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Introduction 


Rice University's ELEC 301 course introduces the notion of signal 
processing, and deals with signals, systems, and transforms, from their 
theoretical mathematical foundations to practical implementation in 
circuits, computer software and hardware. 


The following is a collection of the seven group projects from the Fall 2014 
offering of the course. 


Background and Motivation 
First section 


Noise has long been an enemy of human beings. It can disrupt 
communications, impair concentration, and even damage the auditory 
system if it is sufficiently loud. The traditional noise cancellation idea 
involves energy absorption: by strategically placing damping material close 
to the noise source, much of the sound energy can be converted to heat, 
therefore reducing the volume of the sound. The soft, porous, multi-layer 
material covering dance room and studio room walls can damp most of the 
energy of noise. These methods are referred to as “passive noise 
cancellation”, since they only passively quarantine, or absorb, noise. The 
result can be ideal, since most foam can reduce the loudness by up to 35 dB. 
The weakness of this type of noise cancellation, however, is its scope. All 
sounds in the absorptive frequency range are reduced; if useful sound 
information is present, this information is filtered out as well. Passive noise 
cancellation also has a higher cost and requires more complicated 
instrument installation. 


To deal with this problem, and to better cancel noise based on its 
characteristics (as opposed to general cancellation), people invented the 
idea of active noise control (ANC). Unlike passive noise cancellation, ANC 
involves playing “anti-noise”. Bose promotes this idea to the public, thus 
making ANC one of the hottest, most popular topics about “sound” these 
days. 


First Experiment 
Second Section 


The mechanism is simple. Let’s first experience this in a more direct way, 
and use another pair of sense organs on your body: your eyes. This is a 2D 
cancellation. 
Lenna Images 


Original image 


Inverted image 


The result of combining the two pictures 


This is the famous standard DSP processing image, Lenna. Figure 1 is the 
original one, and Figure 2 is its inverted image, which means every pixel is 
the complementary color of the original. 


Our eyes can perfectly detect the difference between these two images. 
Now imagine setting the transparency of both images to 50%, and put one 
of them on the top of the other one. It feels like seeing through two think 
films. As you can see in Figure 3, it finally results in a complete blank 
sheet! 


Second Experiment 
Third Section 


After this first experiment, it will be easier to think about 1D sound. Now 
check the sound waveforms below (recorded on a Sunday morning, in 
Brochstein Pavilion at Rice University, length 10 seconds). 


Recorded sound wave 


It is easy to see that these two sound waves are exactly the opposite of each 
other by carefully checking some specific spikes. 


Now if you play the first OR the second piece of data, you would hear some 
gibberish of background music, people talking, and spoon and plates 
clinking on a typical Sunday morning. That being said, you should realize 
that your ears are not as sensitive as your eyes in some way; they cannot 
detect phase difference monotonically! The second piece of signal, though 
being the negative version of the first piece, sounds identical to humans. 
Now if you play the two signals at the same time, you would get absolutely 
nothing, like what happened to your eyes before: silence! 


To experiment with this more convincingly, try it out with your own 
speaker set. First make sure you have two audio channels. Open MATLAB, 


and type in: 


recObj = audiorecorder; 

recordblocking(recObj,5); % try to make some noise 
noiseData = getaudiodata(recObj); 

antiNoiseData = - noiseData; % invert! 

double channel = [noiseData,noiseData]; 
sound(double_channel, 8000); 


Now you can hear the piece you’ve just recorded. To examine our 
cancelling theory, type in this: 


double channel = [noiseData, -noiseData]; 
sound(double_channel, 8000); 


Note that if you are listening to this by your earphones, you would still hear 
the sound, but notice a subtle difference: your brain automatically calculates 
the position of the sound source by combining the two sound signals from 
your left and right ears. If you were playing them with your loudspeakers, 
however, you would very likely get nothing. Try to place the diaphragms of 
the speakers right towards each other very closely, and the sound waves will 
cancel out. 


This is the basic mechanism of ANC. It is widely applied in most noise 
cancelling systems, such as headphones (to enable one to hear music 
better), on cars (to reduce the noise of the engine and tires), and in cell 
phones (to reduce background noise in conversations). While this sounds 
useful and easy, the real world is not always so straightforward. Detecting 
the phase of a sound wave is hard. Usually a Phase Locked Loop is used to 
determine the phase of input noise and to generate a 180-degree phase- 
shifted "anti-noise" as described above. This algorithm requires a steady 
noise source (resembling a sinusoidal wave) and is therefore unable to filter 
out more complex noise such as the background noise of a crowd. Even if 
the input were a pure sine tone, the reduction would be no more than 10dB. 
The greatest restriction is the geometry of environment settings. The 
location of the microphone used to pick up the sound signal and the location 
of output loudspeakers determine the sound quality of the noise 
cancellation. As you may have noticed, when you put your two speakers 


right in front of each other with no space between them, the cancellation 
works best; as the speakers are moved away from each other, you can hear 
louder and louder noise. The sound waves tend to reflect off of the wall, or 
any other surface, sometimes reinforcing each other and resulting in 
unpredictable behavior. Another limitation is that high frequency noise has 
really short wavelengths, which makes the distance between ears non- 
negligible, so that the cancellation signal for the left ear can actually 
reinforce noise for right ear. 


Adaptive Filter System 
Fourth Section 


Introduction to LMS Adaptive Filter 


Solving the lower frequency noise speaker-microphone topological problem 
requires implementation of adaptive filters. An adaptive filter has two major 
components: an FIR filter and an Adaptive Algorithm. The FIR filter’s 
transfer function is controlled by variable parameters, computed by the 
adaptive algorithm, that optimize output. The algorithm for this project is 
Least Mean Squares (LMS). The whole process can be described by the 
diagram below: 


Input to the system 
x(n)}=V(n)+S(n) 


Eestimate(n) =V(n) 


Noise Source Function: Update FIR 
coefficient w(n) 


System Block diagram 


The FIR filter is constructed with N coefficients, w, and delays to generate 
the desired output. With linear combinations of these delayed signals, the 
filter can produce excellent results. 


aia of Zz Nt 


(/)w(n) (/)w,(n) (/)w,(n) 


Structure of an FIR filter 


LMS Algorithm Diagram 


Let’s first go through the system and get to know the basic idea. The system 
has two inputs and one output. The two inputs are the pure noise Y(n) and 
the input recorded in the environment (human voice V(n) plus noise S(n)). 
The output is our desired signal (human voice without noise or with little 
noise) Eestimate(n). It’s an estimated signal of pure human voice; it is also 
called the error signal in the LMS algorithm. 


The system consists of a summer and an Adaptive Filter. The adaptive filter 
consists of a FIR filter (shown in the figure above) and an adaptive 
algorithm. The noise cancellation happens in the summer. We will see its 


details in the next paragraph. The feedback loop is the most important part 
of this system. It takes the output of the system Eestimate(n) (also the 
output of the summer) and feeds it to the adaptive algorithm. Then the 
adaptive algorithm updates the coefficients w(n) of the FIR filter. The 
output of the adaptive filter is the signal Yestimate(n). Feed this output to 
the summer. By doing so, we improve the approximation of noise 
Yestimate(n). In the summer we do the cancellation again and get a new 
output Eestimate(n). This completes one loop. After each iteration we 
compute the power of Eestimate(n), the number we want to minimize using 
the LMS algorithm. In short, the main function of the adaptive filter is to 
minimize the power of the error signal Eestimate(n) by adjusting the 
coefficients of the FIR filter via the LMS algorithm. 


Now let’s zoom into the details of the system. It runs in following five 
steps: 


1. Set up a initial value for w,,(0) 

2. Design the output of the FIR filter: Yestimate(2)= YM m-0Wm(n)Y (n-m), 
where M is the order of the FIR filter. 

3. Get Eestimate(1), where Eestimate(M)=X(N)-Y estimate(M)- This step happens 
in the Addition component of the system. 

4. Use the LMS algorithm to update the coefficients w(n) of the FIR 
filter. W,,(0+1)= Wy(D)+2Ugestimate()*s(n-m), for m in [0,M]. The 
power we want to minimize is: E[Eostimate(N)*]=E[V(n)*]+(E[S(n)- 
Yestimate(1)])* Notice that we introduced a new variable u here. We will 
explain it later. 

5. Change n to n+1, and repeat the four steps above 


Let’s call u in step 4 the “convergence factor”. We want 0<u<1/(kpax), 
where kyax is the largest eigenvalue of the auto-correlation matrix R,,. The 
reason for this inequality will be explained in the next paragraph. Now the 
difficult part is finding the value of u. Note that the statistical information of 
x(n) is unknown in practical use of this system, since we do not have 
predictions about either the noise or the speaker’s voice signal. This means 
we do not know what R,, is. Therefore, under a certain environment, we 
need to try different values to get the best u. If u is very small, it takes the 
system a long time to converge. If u is very big, the system converges fast 


with the cost of quality of the noise cancellation. Thus, trying values of u is 
crucial to this project. 


Now let’s see where 0<u<1/k,,,, comes from. This whole system is actually 
a combination of Weiner’s Filter and an adaptive algorithm. We will skip 
the mathematical computation of Weiner’s Filter here, since it is not our 
main focus in the project. When this system’s iteration goes to infinity, the 
expectation of the vector w,,(n) will converge to the Weiner’s Filter 
solution. However, it has a restriction that abs(1-2u*k,,,,)<1 (absolute value 
of diagonal elements is less than 1), which gives us 0<u<1/k,,,, directly. 


Third Experiment (Result of Testing the Adaptive Filter 
System) 


Now let’s discuss the results of our testing. We passed a recording of a 
human voice into this system, together with constant-frequency sine wave 
noise. Suppose the frequency of this sine wave is Wo (we are using a capital 
W here because we are using lowercase w to represent the coefficients of 
the FIR filter). 


Since the pure noise signal is a constant-frequency sine wave with 
frequency Wo, we know that Y(n)=exp(jWon). Thus, 


1. wy, (n+1)=w,,(0)+2uUEectimate()eXPp(-JWo(n-m)), , where m=0,1,2,...,M 

2. Since w,,(n)=y,,(n)exp(-jWo(n-m)), by plugging Equation 2 into 
Equation 1, we get: 

3. Ym(n+1)exp(-]Wo)=¥m()+2uUE estimate(M) 

4. Taking the z-transform of Equation 3, we get z*y,,(z)exp(- 
jWo)=Ym(Z)t+2uE(z), where E(z) is the Z-transform of Egstimate(M) 

5. Then, y(z)=E(z)*2uexp(jWo)/(z-exp(jWo)) 

6. Since Sestimate(2)= > m=0Wm(n)exp(jWo(n-m)), 

7. by plugging Equation 2 into Equation 6, we get Sastimate(1)= 
ye n-o¥mt) 

8. Taking the z-transform of both sides of Equation 7 and using Equation 
5, we get S(z)=E(z)*2u(M+1)exp(jWo)/(z-exp(jWo)), where S(z) is the 
z-transform Sestimate(N). 

9. Since Festimate(1)=X(N)-Sestimate(M), 


10. Taking the z-transform of Equation 9, we get E(z)=X(z)- 
E(z)*2u(M+1)exp(jWo)/(z-exp(jWo)) 

11. Therefore, the transfer function H(z)=E(z)/X(z)=[z-exp(jWo)]/[z- 
exp(JWo(1-2u(M+1)))] 


From this transfer function (Equation 11), we get a zero at z=exp(jW); this 
means the system is canceling the sine wave with frequency Wo from the 
system input x(n). This is the exact goal of this system. Also, notice that if 
we choose a very small “convergence factor” u, all the poles of this system 
would be inside the unit circle of the Z plane, which suggests that the 
system is BIBO stable. 


After we input human voice and a constant-frequency sine wave through 
this system, the result is pretty good. We can hear the human voice clearly 
with the filter on. 


Advantages of Adaptive Filter System with LMS Algorithm 


In practice, we don’t know statistical information of the input such as the 
auto-correlation function R,,. Adaptive Filters and LMS algorithm do not 
need this information to accomplish noise cancellation. This is a big 
advantage over the Weiner Filter and Kalman Filter. 


Furthermore, the LMS algorithm requires fewer mathematical 
complications than other algorithms (such as RLS), and is also easier to be 
implemented in practice. 


FIR filters have linear phase, so the signal won’t be distorted, which is a big 
advantage over IIR filters. Moreover, FIR filters have only zeros, and are 
always BIBO stable. Stability is crucial for signal processing. 


Applications and Future Improvements 
Fifth section 


This noise cancellation method is useful in many real applications. The first 
usage would be crowd control. Imagine calling out something in the middle 
of a huge noisy crowd. Even with the help of microphones and amplified 
loudspeakers, it is hard to make oneself heard since the crowd noise would 
be picked up by the microphone as well. An easy fix is to implement this 
noise cancellation system, and with some simple DSP tricks, raise the sound 
frequency of the speaker by a few decades (just above the noise frequency 
baseline), and everyone could hear a clean, loud tone. Anything with voice 
communication could implement this system, and the effect would just 
sound like the narrative is mixed later to the original tape. 


The main problem of this system is that currently it is not truly real-time 
under the MATLAB environment. We are recording voices for a certain 
length of time (5 seconds, for example) and feeding this recording into our 
system. We tried saying 1 to 10 to the system, and when we listened to our 
output, we can only hear 1 to 6; information between 7 and 10 is chopped 
off. That means the tradeoff between the speed of convergence and the 
quality of noise cancellation needs to be considered more carefully in future 
works. Furthermore, the system does not work well for noise with “high” 
frequencies above 2kHz. Figuring out how to make this system work in a 
larger range of frequencies is another issue we need to consider in the 
future. 


Conclusion 
Sixth Section 


This ANC system has many advantages. It is not constrained to the 
geometry of the environment, since the microphone picks up the sound and 
builds an adaptive filter based on the input noise. If the location is 
changing, it takes an extra loop to update the secondary training inside the 
main function. Furthermore, even sound that is louder than the desired 
signal volume could be filtered out as long as it’s a clean tone. 


The biggest challenge is the design of the FIR filter and the choice of audio 
recording. We played around with different number of orders, sampling 
rates, and buffer sizes to generate the optimal sound from MATLAB. We 
successfully filtered out noise, and it is pretty amazing seeing it work. 
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Background 


Introduction 


Object recognition is one of the hottest areas in computer vision. Finding 
and identifying objects in an image is fundamental to facial recognition, 
object tracking in videos, object matching and many other applications. To 
achieve object recognition (from images), one first must be able to extract 
distinct features of the image, then correctly identify the object of interest 
from these key features. Our group explored three different algorithms. The 
first, Scale-Invariant Feature Transform, was a method to obtain the distinct 
features of the image while the other two, Hough Transform and Moment of 
Inertia, were pattern recognizing algorithms. 


Motivation 


We wanted to simulate a user being given a sheet of paper with squares and 
other shapes on it. The user would tell us how many squares appeared in the 
image. Because we were highly interested in the field of object recognition 
we decided to implement two systems and compare the results. Both 
systems used Scale-Invariant Feature Transform to obtain feature points. 
The first system then passed those results into the Hough Transform. The 
second system used the feature points to implement a formula involving the 
properties of the Moment of Inertia. 


System 2- 

SIFT then 

Moment of 
Inertia 


System 1- 
SIFT then 
Hough 
Transform 


Flow of the system design. 


Our square 
template. 


An example test image. 


Implementation of Scale-Invariant Feature Transform 


Introduction 


The Scale- Invariant Feature Transform is an algorithm in computer vision 
to detect and describe points of interest in an image. 


Approach 


According to Prof. Dowe’s paper on Distinctive Image Features from Scale- 
Invariant Keypoints(2004), there are four stages to SIFT: 


1. “Scale-space extrema detection”- Searches the entire image for 
candidate interest points 

2. “Keypoint localization”- Calculate the location and scale for each 
candidate, remove candidates that are not stable 

3. “Orientation assignment”- Assign each key point with one or more 
orientations that are calculated based on the gradient direction at that 
key point location in the image 

4. “Keypoint descriptor”- For each key point, calculate the gradient in its 
surrounding area. This allows the transform to be distortion resistant. 
The above approach “transforms image data into scale-invariant 
coordinates relative to local features”. 


Our project seeks to achieve scale, rotation and translation resistance. 
However due to time-constraint, we did not implement stage 4 “Keypoint 
descriptor” of SIFT. 


Implementation 


Stage 1 


Apply Gaussian filters of different scales to the image. By using different 
scales the Gaussian filters would have different variances. Due to the 
inherent properties of Gaussian filters, this would “smooth” out the images, 
removing finer details of the image. At different scales, the details of the 


image that are insignificant compared to the standard deviation of the 
Gaussian filter applied would be removed. The Gaussians are generated 
using the following formula: 


, a PX | fey 2 
G(z, y,0) = Sno?’ (a? +-y")/20 


Then the image, represented as an array of digits, is convolved with the 
Gaussian. 


L(x, ¥; a) = G(x, Y, a) * T(x, y) 


L(x,y,sigma) is the value of the resulting image at location (x,y) under the 
Gaussian filter with standard deviation sigma. I stands for the original 
image. 


We applied Gaussians with scale 0, 1, and 2 to the image. At scale 0, we are 
essentially preserving the original image, at scales 1 and 2 we are 
“smoothing out” the image to an increasing extend. We have 3 octaves of 
resulting images, each octave consists of images resulting from repeated 
applying the gaussian filter of the same scale to the original image. After 
each octave, the image is down-sampled by two. 


Code 


S1st Octave 
Octavel = [); 


ki=@; 
Image(x:x*4,y:y+4)=0; “Szero-pad 
for kj=0:3 


kk=sqrt(2); 
Sigma=(kk*(kj+(2*ki) ))*1.6; 
“generating the gaussians 
for m=-2:2 
for n=-2:2 
gaussian] (m+3,n+3)= (1/{(2*pi)*((kk*sigma)*(kk*sigma) )) )*exp(-( (mem) +(n*n)) /(2*(kk*kk)*(sigma*sigma) )); 
end 
end 
*convolve image with the gaussian filters generated above 
for iml:x 
for jel:y 
singlesum=Image(i:i+4, j:j+4)'.*gaussian1; 
convi(i,j)=sum(sum(singlesum)); 
end 
end 


Octavel=[Octavel convl]; 
end 


Stage 2 


Now we have the image smoothed to different extends, with variant amount 
of fine detailed preserved in the resulting images. Within each octave, we 
use Difference of Gaussian, which is basically subtracting neighboring 
images from each other. Difference of Gaussian is proven to be a close 
approximation of scale-normalized Laplacian of Gaussian, which is shown 
to "produce the most stable image features compared to a range of other 
possible image functions, such as the gradient, Hessian, or Harris corner 
function”. Moreover, Difference of Gaussian is efficient to compute since 
it’s just subtracting images. 


Code 


s*Difference of Gaussian 

% differnce of gaussian for octavel 

diff_11=*0ctavei(1:512,1:512)-Octavel(1:512,513:1024); 

diff_12=0ctavei(1:512,513:1024)-Octavel(1:512,1025:1536); 

diff_13=0ctavei(1:512, 1025: 1536) -Octavel(1:512, 1537: 2048); 
% difference of gaussian for octave2 

diff_21=0ctave2(1:256,1:256)-Octave2(1:256,257:512); 

dif f_22=O0ctave2(1:256, 257:512)-Octave2(1:256,513: 768); 
diff_23=O0ctave2(1: 256,513: 768) -Octave2(1:256, 769: 1024); 

% difference of gaussian for octave3 

dif f_31=Octave3(1:128,1:128)-Octave3(1:128,129: 256); 

dif f_32=Octave3(1:128,129:256)-Octave3(1:128,257:384); 
dif f_33=Octave3(1:128,257:384)-Octave3(1:128,385:512); 


Then for each pixel in a resulting image, we compare it to its eight 
neighboring pixels in the same image and nine neighboring pixels in the 
images processed by adjacent scales. It’s selected if it’s greater or smaller 
than all its neighbors. The result is a candidate key point. 


Code 


%for maximum 
%compare the pixel with it's neighbors in the same image 
if (((diff_12(i, j)>diff_12(i-1,j))&&(diff_12(i, j)>diff_12(i+1,j)).... 
6&&(diff_12(1i,j)>diff_12(i, j-1) )S&(diff_12(i,j)>diff_12(i+1,j+1)).... 
&&(diff_12(i, j)>diff_12(i-1, j-1))&&(diff_12(i,j)>diff_12(i-1,j+1)).... 
&&(diff_12(i, j)>diff_12(iel, j-1))&&(diff_12(i, j )>diff_12(i,j+1)))) 
xl=xi+l; 
else 
continue; 
end 
if xl>@ 
*%compare the pixel with its neighbors in the image processed by gaussian of 
%1 scale more 
if((diff_12(i,j)>diff_13(1i,j))&&(diff_12(i,j)>diff_13(i-1,j)).... 
&&(diff_12(i,j)>diff_13(i+1,j))&&(diff_12(i, j })>diff_13(i,j-1)).... 
&&(diff_12(i,j)>diff_13(i+1, j+1) )&&(diff_12(i, j)>diff_13(i-1,j-1)).... 
&&(diff_12(i, j )>diff_13(i-1, j+1) )&6(diff_12(i, j )>diff_13(i+1, j-1) )&&(diff_12(i, j )>diff_13(i,j+1))) 
yleyi+l; 
else 
continue; 


end 
end 
‘Scompare the pixel with its neighbors in the image processed by gaussian of 
%1 scale less 
if yl>e 
if (({diff_12(i,j)>diff_11(i,j))&&(diff_12(i,j)>diff_11({i-1,j)).... 
&&(diff_12(i,j )>diff_11(i+1,j))&&(diff_12(i,j)>diff_11(i,j-1)).... 
G&&(diff_12(i, j)>diff_11(i+1, j+1) )&&(diff_12(i, j )>diff_11(i-1,j-1)).... 
&&(diff_12(i, j )>diff_11(i-1, j+1) )&&(diff_12(i, j )>diff_11(i+1, j-1) )&&(diff_12(i, j )>diff_11(i,j+1))) 
zlez1+¢1; 
else 
continue; 
end 
end 


Stage 3 


To calculate the magnitude and orientation of each key point, we look at all 
it’s neighboring pixels in the image that is processed with the same scale. 


m(x,y) = (L(a +1,y) — L(w — 1,y))? + (L(a,y + 1) — Lz, y — 1)? 


O(x,y) = tan~'((L(2,y + 1) — L(x,y — 1))/(L(x +1, y) — L(a —1,y))) 


m(x,y) stands for the gradient magnitude of the point and theta(x,y) stands 
for the orientation of the point. 


Code 


*Calculating Magnitude and Orientation 
for i=2:511 
for j=2:511 
‘calculating magnitude and orientation of each point by looking at 
*%its neighbors 
magsquared(i,j)=((diff_12(i+1, j)-diff_12(i-1,j))°2)+((diff_12(i, j+1)-diff_12(i, j-1))*2); 
orientation(i,j)=atan2( (diff_12(i,j+1)-diff_11(i,j-1)), (dif f_12(i+1,j)-diff_11(i-1,j))); 
end 
end 
mag=sqrt(magsquared); 


Implementation of Hough Transform 


Introduction 


The Hough Transform is a feature extraction technique that estimates 
parameters of a shape from its boundary points. Advantages of the Hough 
Transform include that it is scale-invariant, shift-invariant and rotation- 
invariant. It also functions well with added noise. Disadvantages include 
that it is computationally expensive and that it can falsely detect multiple 
instances of a single edge. 


Approach 


The Hough Transform converts points in an edge picture to sinusoids in a 
parameter space expressed in polar coordinates. 


These sinusoids are put into an accumulator. When two points are collinear, 
the sinusoids corresponding to the two points intersect at a point. These 
intersections form peaks in the Hough domain. The peaks in the 
accumulator tell us which features in an image are present. For example, a 
square has four Hough peaks, corresponding to each corner in a square. 


Features that the Hough Transform can detect include lines, points, and 
curves such as circles and ellipses in images because these features have 
parametric representations. 


Implementation 


1. Convert Image to binary scale 


a.The first step we need to do is to convert color images to binary images 
such that the only region of interest are the boundaries of shapes and not the 
varying colors the shape may have. 


Template = imread('Template.jpa'); 
Image = imread('TestiImage2.pnaq'); 


bwlemplate = im2bw (Template); 
bwimage = im2bw(Image,90.1); 


2. Given the binary image, we use the Matlab edge function to detect 
the edges 


% [qTemplate ,t ] = edge (bwlemplate, 'Canny'); 
% [gImage ,t ] = edge(bwimage, 'Canny'); 


3. From the edges we compute the Hough Transform 


{H,T,R] = hough (BW, 'RhoResolution',0O.5, 'Theta',-90:0.5:89.5); 


4. Finally, we calculate the Hough peaks 


% detect Hough peaks 
numpeaks = 10; tNumber of peaks to look for 
P = houghpeaks(H, numpeaks); 


Implementation of Moment of Inertia 


Introduction 


To achieve scale, translation and rotation invariance in object recognition, 
we need to identify intrinsic traits of the object that are not affected by these 
operations. Moment of Inertia is mass property of a rigid body that 
determines the torque needed for a desired angular acceleration about an 
axis of rotation. Moment of inertia depends on the shape of the body and 
thus can be used as parameter for simple object recognition. 


Approach 


Moment of Inertia can be calculated if given centroid of the object and it’s 
mass. 
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I is Moment of Inertia, m is the unit mass at a location of distance r from 
the centroid, and r is the distance from the centroid. 


However a two-dimensional image does not have the mass to be used for 
calculation of Moment of Inertia. We can instead use the intensity of each 
pixel as mass of a particle located at the pixel. With that, we can calculate 
the Moment of Inertia of our images. Since we are only concerned with the 
shape of the object, we can convert the images to black and white (with 
intensity of each pixel being either 0 or 1) and consequently the intensity of 
each pixel can only take on 2 values and the Moment of Inertia can be much 
more easily calculated given this simple duality. 


Actual Formula used- 
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Diagram of Process Used to Calculate Moment of Inertia- 
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Code 
Main Functions Used in Matlab: 


e im2bw- change image to black and white 


¢ bwboundaries- getting boundary indices of object 
¢ regionprops- getting centroids 
[-])function nMomentOfInertia=MomentofInertia (binaryimage, centroid) 
length=size (binaryimage, 1); 


width=size (binaryimage, 2) :| 
MomentOfInertia=0;%MOI 


L-] for i=1:length 
J for j=1:width 
if binaryimage (i,j) 
MomentOfInertia=( (i-centroid (1) )*2+(j-centroid(2) )*2)+MomentOfInertia; 


end 
end 
end 


NumberOfPixels=sum(sum(binaryimage) ); 
nMomentOfInertia=MomentOfInertia/ (NumberOfPixels.%2); 
tnormalized I with total number of pixels 


- return 


[-|function maskedimage=masking (binaryimage, NumberofObjects) 
boundaries = bwboundaries (binaryimage); tboundary indices 
length=size (binaryimage,i);%size 

width=size (binaryimage, 2); 


(-|for k=1:NumberofObjects 
maskedimage{k}=binaryimage; 
=] for i=1:length 
E) for j=1:width 
in = inpolygon(i,j,boundaries{k}(:,1),boundaries{k} (:,2))- 
tdetects if a point of interest is within the boundary 
then mask all points within boundary to be 1 and everywhere 
telse to be 0. 
if (in) 
maskedimage{k} (i,j)=1; 
else 
maskedimage{k} (i,3)=0-; 


end 


Results 


Scale-Invariant Feature Results 


SIFT Results on image of Rice 
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The curves are well detected but horizontal and vertical lines are lost. 


SIFT Results on BIBO Bear 


Original mage to convertedte Gray 


Leyponts locabons 


The words BIBO can be seen on the bear's chest. 


SIFT results on President Leebron 


Original lmege Loconverledto Gray keypomnts locations 


It is evident that it is the image of a person. 


Although SIFT is extremely useful, it cannot easily find vertical or 
horizontal lines because they are not considered feature points. Since our 
image templates were made up of almost all straight lines, we had a 
significant issue. Due to time constraints we chose to implement the edge 
functions in MatLab. Instead of passing the results of SIFT into the Hough 
Transform and Moment of Inertia we passed the results of the edge 
functions in MatLab. 


Results of Matlab Functions 


Results of Passing a Test Image into Matlab Edge Functions 


Example 
test image 


Edges obtained using edge function Edges obtained using bwboundaries 
in Matlab function in Matlab 


Clearly the edge results from matlab are much more suited to our type of 
test images. 


Hough Transform Results 


Hough Transform of Square Template with Hough Peaks in 
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The Hough Transform of the template image. 


Hough Transform of Test Image 1 with Hough Peaks in Blue 


Hough Transform of Test Image 1 (0.1% Salt and Pepper Noise) with Hough Peaks in Blue 


Even with noise same peaks are found. 


Hough Transform of Test Image 2 with Hough Peaks in Blue 
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Hough Transform of image 


Hough Transform of Test Image 2 (0.1% Salt and Pepper Noise) with Hough Peaks in Blue 
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Same peaks are found with noise. 


Hough Transform of Test Image 3 with Hough Peaks in Blue 
[rams 
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Hough Transform of Test Image 3 (Gaussian Noise mean 0 variance 0.0001) with Hough 
Peaks in Blue 
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Hough Transform Noisy image 
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With this amount of noise the peaks are still found. As long as the noise 
does not interfere with the edges the peaks will always be found. However, 
if the edges are lost then the Hough Peaks will be incorrect. 


Moment of Inertia Results 


Description of Steps in Demo 


1. Read in the image as a matrix of type ‘double’ and turn it into a logical 
array (im2bw function) 

2. Using bwboundaries functions, find all connected boundaries (default 
is 4-connectedness of pixel) 

3. Iterate through all closed boundaries and mask all pixels within 
boundary to be 1 and 0 outside the boundary 

4. Find the moment of inertia of the masked images which is normalized 
by the square total number of pixels 


A demo to show how Moment of Inertia works 


Moment Of Inertia = 1.666646e-01 Moment Of Inertia = 1.665643e-01 
Total Pixel Mass = 78961 Total Pixel Mass = 70225 


moo 


There are two boundaries per shape because the outline of the image is a black line of a certain width. Although the 
two masked shapes resulting from this duplication have different Total Pixel Mass, they have almost identical 
normalized Moment Of Inertia. Hence in the implementation of this algorithm, the total number of objects recognized 
is halved to eliminate duplicates. 


Moment of Inertia Method on Simple Image 


Test Image Jb Template Image 


Number of Squares= 3 


100 200 300 400 500 600 700 


Moment of Inertia method on a simple image without noise. 3 
squares are correctly identified. 


Moment of Inertia Method works perfectly on a test image without overlaps 
where the shapes are significantly different. 


However, bwboundaries returns closed boundaries which would create 
problems when you have intersection of lines. 


Moment of Inertia Method on Image with Overlaps 


Moment of Inertia Method for Image with Overlapping Images 


Moment Of Inertia = 1.945959e-01 Moment Of Inertia = 1.895098e-01 
Total Pixel Mass = 56989 Total Pixel Mass = 22674 


Moment Of Inertia = 1.665494e-01 Moment Of Inertia = 1,912433e-01 
Total Pixel Mass = 4830 Total Pixel Mass = 21675 


ml 


We can clearly see from the previous images that the function now detects 4 
objects instead of the three that we are interested in, shown below as the 


two larger squares traced out by red outline and the smaller one filled by the 
color green. 


What the MOI should have 
obtained 


Moment of Inertia Method on Image with Similar Shapes 


Moment of Inertia Method on Similar Shapes 
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In the above image, objects in the top right and top left corners of the 
subplots differ in moment of inertia by 0.703%, in spite of having 
drastically different shapes. However, we note that such algorithm would 
only produce false positive. 


Conclusion and Future Work 


Conclusion 


Scale-Invariant Feature Transform 


After performing SIFT on a number of images and finding the location of 
the key points on those images, we realized that although SIFT recognizes 
the sharp corners and curvatures very well, it would not recognize key point 
along straight lines. Looking back at the SIFT algorithm, we found out that 
this is because straight lines are not scale and rotational invariant as 
compared to corners. Since SIFT focuses on the robustness of key points 
and on the features of an image, straight lines do not qualify as features. 


Since, our goal is to identify squares in an image mainly comprised of 
geometric figures we are mostly dealing with straight lines. Hence, SIFT 
would not be a suitable algorithm for us to locate the edges in the image. 
We would need to resort to other methods. Due to time constraints we 
turned to Matlab edge Functions. 


Hough Transform 


The Hough Transform correctly identified the squares in an image but was 
computationally expensive. Every single pixel in the image had to be 
looked at. Also if enough noise was added that an edge was lost the Hough 
transform ceased to work at all. Overall the Hough Transform is a more 
robust approach because it can be generalized to even more complex 
situations where the Moment of Inertia method cannot. 


Moment of Inertia 


The Moment of Inertia method proved to be extremely rudimentary. It 
worked on simple images where the shapes had very different Moments of 
Inertia and the shapes did not overlap. But if the shapes overlapped the 


boundaries were lost and the shapes were not identified correctly. Also, if 
two shapes had similar Moments of Inertia the system would give a false 
positive. Obviously Moment of Inertia is not a method that can stand on its 
own. However, it only gives false positives. It won't accidentally mask out 
an actual result. Due to that and the fact that it is computationally very 
cheap, it could be used as a preprocessor for a more computationally 
expensive algorithm. 


Future Works 


Continuing on from here, we want to optimize our implementations. After 
successfully improving the speed of our code we would want to solve the 
issue our Moment of Inertia method has with overlapping images. This 
could be solved by starting from the smallest shape and moving outward. 
Finally, we would reorganize our system to match the following design: 
New Design Model 


Moment of Hough 


Edge Function 
Inertia Transform 


in Matlab 


A better system design. Use Matlab edge detection, preprocess with 
Moment of Inertia, and then pass through Hough Transform 


Implementing this design would save computation time in the Hough 
transform. The Moment of Inertia would mask out shapes that were clearly 
not correct which would reduce the number of Hough Peaks and reduce the 
number of calculations. After implementing this improved system we 
would want to create our own way of obtaining the edges of our images 
besides Matlab. 
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Our Approach to Object Recognition 


Our goal is to design two systems 
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compare the two systems, 
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Implementation 
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Analysis and Conclusion 

Scale-Invariam Feature Transform 

+ Missed some points of interest and did not filter 
out more unstable feature points 


Hough Transform 

* Able to locate correct peaks in noisy image 

+ Almost all shapes can correctly be discerned 
with Hough Peaks 


Moment of Inertia 

+ Excellent preprocessing tool 

+ Ifmaments of inertia are nat unique enough 
(shapes are similar) then the shape cannot be 
distinguished 


* Ifedges cannot be properly detected then 
neither system will wark (whether from noise 
or bad approach) 

+ Overall, Hough Transform is a better system 


Introduction 


Optical character recognition, or OCR for short, is applicable to a variety of 
fields and problems. Take the example of recognizing the license plates of 
cars running red lights caught by speeding cameras. Using OCR, we might 
determine the characters on the license plate of the speeding car. 


Initially, we were planning on doing a project along the lines of license 
plate recognition. By developing an application that could recognize the 
characters on a license plate, we would be able to automate the billing 
process for cars caught by speeding cameras, for example. However, as we 
searched the Internet for tutorials on how to perform optical character 
recognition (OCR), we had difficulty finding a fully comprehensive one 
that taught us how to train an algorithm on a custom dataset and then 
perform character recognition on similar images. We did find MNIST, a 
database of handwritten digits, but it does not include letters, nor does it 
pertain to the artificial text on license plates. As a result, we decided we 
would make our own tutorial. 


Our main goal was to create a general step-by-step, easy-to-use tutorial to 
optical character recognition by performing basic techniques available in 
Python’s OpenCV library on a simple data set, to be uploaded to 
Connexions as a contribution to the open education movement. 


As a result, we decided that we would perform OCR on typed-text input in 
png and jpeg files as our valid input, with the letters of the alphabet (both 
capital and lowercase) and digits 0-9 as our classes of characters. 


Methodology and Design Choices 


We decided to create our tutorial in Python using the open source computer 
vision library, OpenCV, for its ease of use and minimal barrier to entry — 
Python is free, where Matlab is not, and Python is easier to learn than C++ 
for beginners. OpenCV, being open-source, is well documented in its 
functionality, though it lacks the kind of comprehensive tutorial we have 
developed. 


When determining how to implement our tutorial, we decided we would 
keep the methods and training data simple. Our input consisted of clean 
images of typed text to teach the user how OCR can work for typed 
characters — the letters of the alphabet and numbers in this case. We decided 
that it would be simplest if we used images and characters with the correct 
orientation as well. 


The feature vector we used was a vector of the average intensity values of 
each character after resizing it to a standard size because this was simple to 
implement in OpenCV. 


For machine learning models, we went with the k-nearest neighbors (KNN) 
algorithm (with k=1) because KNN is easier to implement and explain than 
support vector machines. We settled on k=1, or finding the single nearest 
neighbor, because we found that it produced optimal results, probably 
because of the relatively small size of our training data set; we only trained 
on 15 fonts. Also, since we would ideally be training on a variety of fonts, 
the boundaries between clusters of characters of each class might not be 
quite as easily defined, especially in the case of similar characters like O 
and 0, which would lend to KNN being more effective than SVM. 


Basics of OCR 


For OCR, we need to assume an image has certain textual characteristics. 
For example, there is no point in using a picture of a tree as input to text 
recognition software. 


After reading in an image, the first step to OCR involves preprocessing. 
Any picture the software reads in can be represented by a matrix of red- 
green-blue(RGB) values. Rather than deal with three variables, we can 
make this more computation friendly by converting from RGB to grayscale; 
instead of a matrix with three separate values in each cell, we now have a 
matrix of intensity values between 0 and 255. 


Another issue is that the images dealt with in OCR are not necessarily 
properly oriented; they may be skewed, angled, or flipped. This incorrect 
orientation would make it more difficult for us to correctly classify 
characters. In order to remedy this, a typical preprocessing requires us to 
transform and translate the pixels within the image in an attempt to realign 
the text. 


Depending on our inputs and assumptions, there are multiple options on 
what we will use as a filter. For our purposes, we will be looking at a few 
filters meant for edge detection, specifically Gaussian, Laplacian, and 
Sobel. 


After filtering, we will be utilizing OpenCV's wide array of functions to 
detect our characters and identify them. The API provided will handle tasks 
such as defining the threshold of an edge and the actual edge detection. For 
classifying the characters, OpenCV has a machine learning algorithm, K- 
Nearest Neighbors that we capitalize on. 


Tutorial 


Preparation 


All files referenced and other helpful information is posted on this site at 
the GitHub link. The GitHub link also has 3 Python files, which is our 
completed OCR software. This tutorial serves as a walkthrough of the code 
under recognitionScript.py, where more in-depth explanations are discussed 
in the provided .py files. 


This tutorial uses Python 2.7, Numpy, and OpenCV for the software 
development portions. The instructions for downloading OpenCV for Linux 
can be found at this link. For windows, go to to opencv.org For Numpy, 
visit numpy.org For Python, visit python.org 


We assume the reader will have been exposed to Python before. If not, a 
general understanding of Python IO and the ability to look up Python 
documentation is preferable. 


Preprocessing 
For our tutorial, we will make a few assumptions for our input images. First 


off, the image will only have printed text. We also assume the image will be 
a png. For example, we will have pictures similar to the figure below. 


Y) 
PASDFGHIJIKLZXC 


tyuiopasdfghjktltzxc 


Because this is an introduction, there will not be information on 
transforming and translating, as this involves verbose algorithms beyond 
the scope of our knowledge. 


Second, images may contain the full range of colors. 


The first step is to read in the image we will be performing OCR on. 
OpenCV has the ability to do this with imread() function. We have made a 
wrapper function that does this and returns a copy, which will be necessary 
for some OpenCV functions 


im = cv2.imread(fileName) 
imcopy = im.copy() 
» fileName 


Right now, im is an image, which can be represented as a matrix with 
pixels. Each pixel is some object with 3 values, red, blue, green(rgb). In 
order to make this compatible with future functions, we will convert the 
image to grayscale. This involves a function provided by OpenCV, which 
process the rgb values and replace the pixel with some intensity value. The 
next step is to convert the image into grayscale. Open Cv has a color 
converting function called Cv2.cvtColor(image, code ) which can map one 
type of color to another based on the input code. For the purposes of 
converting our image into grayscale we use the parameter 
Cv2.COLOR_BGR2GRAY and call our the helper function color2gray, 
which returns the grayscaled image. 


( ): 
gray cv2.cvtColor(image,cv2.COLOR_BGR2GRAY) 
gray 


ABCDEFGHIJKLMNOPQRSTUVWXYZ 
abcdefghijklmnopqrstuvwxyz 
0123456789 


ABCDEFGHIJKLMNOPQRSTUVWXYZ 
abcdefghijklmnopqrstuvwxyz 


0123456789 


Above is our grayscaled image. 


Filtering 


Next, we will need to apply a filter onto the image. Depending on the task, 
we could use many different filters to achieve a certain goal. We want to 
filter our image to increase our ability to detect edges; this allows us to 
identify characters in the image. Therefore, we will be covering the 
Gaussian, Laplacian, and Sobel filters. 


Gaussian Filter 
Using the following formula, we will create some n by n matrix to build the 
Gaussian filter. 
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The Gaussian filter is effective towards noisy signals, because of its 
characteristic as also being a low pass filter. By blurring the image to a 
small degree, we allow any sensitive edge detection algorithms to not 
mistake noise to be something of significance. 


Laplace 

In cases where we know that noise is not an issue, we can use the Laplacian 
filer. Rather than worry about reducing noise, the Laplace filter's purpose is 
to enhance edges by sharpening the image. The formula to describe the 
filter is 


at " at 
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Because this filter serves to enhance the features of an image, this also 
increases noise. 


La, y) = 


Sobel 

Sobel filtering involves using two 3 by 3 matrices to convolve with the 
image in order to find the gradient of the image. Although this may be an 
inaccurate approximation, it proves effective for our needs. 
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After testing with multiple files, we had come to the Gaussian filter was 
best suited for the current data sets. There is a convenient OpenCV function 
that does this, Cv2.gaussianBlur( .. ). With appropriate parameters: the 
width and height of the matrix that will be used to filter the image and the 
standard deviation of the Gaussian in the x and y directions (the greater 
standard deviation the less variance among the pixels after filter, i.e. greater 
blur) 


ABCDEFGHIJKLMNOPQRSTUVWXYZ 
abcdefghijklmnopqrstuvwxyz 
0123456789 


ABCDEFGHIJKLMNOPQRSTUVWXYZ 
abcdefghijklmnopqrstuvwxyz 


0123456789 


Above is our blurred image. 


Feature Detection 


The next step is to determine where the edges in the image exist. After 
filtering, there needs to be some way to take each character and define its 
features in some measurable value. Under OpenCV, we used the 
adaptiveThreshold function to take our image and decide whether a certain 
intensity value in the image should be a 0 or 1. Effectively, our processed 
image is currently a matrix of binary values. We have the provided helper 
function. 


In deciding whether a pixel meets the threshold there are two methods: we 
can use an adaptive mean filter or a Gaussian. We found a Gaussian was 
better. The threshold type should be cv2. THRESH_BINARY_INV which 
turns inverts the values-pixels deemed white become black and vice versa. 
This is because Open Cv’s edge finding functions find white characters in 
black backgrounds. 


ABCDEFGHIJKLMNOPQRSTUVWXYZ 
abcdefghijklmnopqrstuvwxyz 
0123456789 


ABCDEFGHIJKLMNOPQRSTUVWXYZ 


abcdefghijklmnopqrstuvwxyz 


0123456789 


Above shows the adaptive threshold of the image. 


From here, we can use OpenCV's findContours method to find the edges. 
This will return coordinates, width, and height of a rectangle around a 
character. Given specific properties, some rectangles may be removed, 
resized, or combined to accomodate special cases. 


Afterwards, we call the following two helper functions, 
findContoursAreas() and removeOverlaps(). findContoursAreas removes 
countours in the list contourlist that do not meet a specified minimum 
height. We use this to ignore small contours, like the tittles around ‘i’s and 
“| 5. 


removeOverlaps goes through the rectangles list and returns a list of 
contours(rectangles) that do not overlap, returning the largest rectangle if 
overlap does occur. This is necessary because the list findContourAreas 
returns all countours in a image, even those that do define an actual 
character.For example contours that make up part of a letter like the “o” in p 
would be included in addition to the contour that we want that encloses the 
“p.” We then sort the list to our liking so we can read the remaining letter 
outlines left to right, top to bottom. We do this with trainingHelper.xsort( 
countour_list ). It uses a simple algorithm to sort the rectangles into sorted 
rows and then into columns. 


ABODERGHIJROMNOBQRSDOUMMRMZ 
abodefghijklmnopgostoumaya 
O128466789 
ABODBRGHIJKLUMNOPQRSDUMMWIRNA 


abodefghijkimoopgostoumayz 


Above shows the code discovering the region of interests. 


Taking each rectangle, we will create a vector to hold the features each 
image has. First, we use a resize method to convert each rectangle into an 
by n matrix of values. The values is determined by dividing the image to n 
by n cells and returning the average intensity of each cell. From here, we 
turn the n x n matrix into a row vector. We can then add other features to 
this vector in order to determine the character. For example, take the 
average of the top half and the bottom half of the image. 


roismall = cv2.resize(roi, (size,size)) 


roismall = roismall.reshape((1,size*size + num_Extra_ Vals)) 


roismall = np.float32(roismall) 


retval, results, neigh resp, dists = model.find_nearest(roismall, k = 1) 


string = ((results[@][@]))) 


Classification 


To classify an image, there needs to be a referenced database to compare the 
characters to. Our provided code solves this through the training method, 
which uses machine learning. Using multiple sets of pictures, we take an 
image, run it through all of the steps above and tell the program what the 
character should be. When finished, we have training data, which holds a 
histogram of vectors mapped to specific characters. 


train(samplesFile, responsesFile): 
samples = np. loadtxt(samplesFile,np.float32) 


responses = np. loadtxt(responsesFile, np. float32)| 


response responses.reshape((responses.size,1)) 
model = cv2.KNearest() 
model.train(samples,responses) 

return model 


Back to the current image, we take its feature vector and use a k-nearest 
neighbor algorithm to determine which vector is most similar to the one 
being compared. From there, we see what the vector is mapped to and 
return what should be the actual letter or number. 
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And there you have it! Thank you for reading through the introduction of 
OCR. Feel free to look through the files, as there will be more resources to 
explain this topic. 


Results & Conclusion 


By the end of the project, we had developed an introductory tutorial to the 
processes and algorithms involved in OCR. 


With the methods we implemented, we were able to achieve little to no 
error when testing on our training data, and when testing on fonts not 
included in our training data set but in one of the font families included in 
the training set, we were able to achieve about 85% accuracy. 
https://cnx.org/content/m52098/ 


Future Work 


In the future we hope to expand our tutorial to take into account the 
multitude of techniques that can be applied during each step of the OCR 
project, ranging from filtering and edge detection to feature extraction and 
classification. 


More specifically, we would implement support vector machines and 
logistic regression as two different learning models in addition to k-nearest 
neighbors, and show users how the results compare using each of these 
models. Pertaining to the data we handle, we would like to incorporate 
handwritten characters into our training data and to introduce noise into our 
images of typed text (by printing them and then scanning the printed 
images). We would also like to create a confusion matrix for the multiple 
implementations of OCR we would like to develop to assess their 
performance. 


And finally, we look forward to taking ELEC 345: Computer Vision to 
expand our knowledge of the topic. 
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Introduction 
An introduction to the D/A conversion design and process. 


What are D/A and A/D Converters 


Digital to Analog (D/A) and Analog to Digital (A/D) converters are 
pervasive and essential in the technology of today. Televisions, smart 
phones, radars, and even sensors in your car all require quality conversion 
in order to work effectively. The improvements and characteristics of these 
converters have significantly supported the digital revolution in past 
decades. D/A and A/D converters are intrinsically inversely related. Digital 
to analog converters are systems that take discrete digital data and convert it 
into continuous analog signals, whereas analog to digital converters achieve 
the opposite effect. Specifically, we will be investigating and implementing 
a D/A converter that takes stored digital data and reproduces the original 
audio signal. An important concept to note is that digital signals can 
undergo manipulation and storage without damaging or losing the original 
data because of their discrete nature. For our project, we chose an R/2R 
implementation of a D/A converter using a BeagleBone Black. We will 
further explore these new concepts later in this educational document, but 
here is our poster which gives a surface-level explanation on most of the 
topics. 
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Characteristics of D/A Converter 


There are multiples D/A architecture designs that one can choose to 
implement. Depending on which architecture one selects, the characteristics 
of the digital to analog conversion are affected. Some characteristics to keep 
in mind when designing a digital to analog converter are speed, accuracy, 
cost, resolution, power consumption, and size. The speed characterizes the 
frequency of the sample production of the audio signal. Accuracy 
determines how true the produced signal is compared to the original. Cost is 
determined by the price of the components used in making the converter 
and can also impacted by the power consumption if measured over time. 
The resolution is the number of voltage output levels which is based on how 
many bits the converter uses. We chose to have 8 bits in our converter so we 
had 248bits = 256 voltage levels. There also exists a reciprocity between 
speed and resolution because the more bits it uses, the more difficult it is to 
reproduce them at a high frequency. The power consumption of the circuit 


depends on how high these other characteristics are and whether one chose 
to implement a passive (low consumption) or active (high consumption) 
system. Active systems include components that control flow and inject 
power in a circuit, such as transistors, while passive systems have 
components that cannot amplify the signal, such as resistors. Finally, the 
size of a D/A converter relies on the scale and number of components used. 


The D/A Process 


First, we will discuss the conversion process in terms of systems and 
signals, and then provide a real world example so that one can see these 
concepts realized. 


{—_}_—_}___1_} 
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Take note that the graphs on the left represent signals in the frequency 
domain while graphs on the right are signals in the time domain. Typically, 
when reproducing an analog signal from digital values, one starts with 
stored discrete data that looks like (1) in the frequency domain when 
produced into a quantized signal. Looking at the time domain in (1), you 
can see the quantized nature of the signal. Note that in the example, the 
original digital signal has nonzero amplitude from -20kHz to 20kHz. This 
range is because our D/A converter has the purpose of reproducing audio 
and the human ear can only hear up to the 20kHz range. Anything above 
this range we cannot hear. However, as you can see there are copies of this 
original digital signal at higher frequencies. We do not want these high 
frequency copies because they consume power when outputted, even 
though we cannot hear them. When experimentally outputting this 


unfiltered digital signal, we also found that they also introduced noise and 
damaged our speaker. So, in order to remove these harmful high frequency 
copies and isolate the original digital signal that occurs at lower 
frequencies, we want to apply a low pass filter as seen in (2). In the time 
domain, we display the impulse response of a simple resistor-capacitor low 
pass filter to help show this cutoff of higher frequencies. Now, in (3), one 
can see that the high frequency copies of the original quantized signal have 
been filtered out. The converter then output this result which is actually an 
exact reproduction of the original analog signal! 


Now, armed with the general D/A conversion process in mind, consider an 
mp3 player producing a song through a speaker. The song audio is 
originally a digital stream of bits that is stored in the player’s memory. This 
digital data can be stored into groups, called packets, or just outputted raw 
into the D/A converter. It takes these bit streams and turns them into a 
digital signal with distinct quantization levels and high frequency copies. 
Then, this digitized signal goes through a low pass filter which effectively 
smoothes out this quantized signal into a continuous analog signal and 
removes the high frequency copies above 20kHz. This signal is a 
reproduction of the original analog signal that was recorded by means of a 
A/D converter! In order to amplify this signal so that we can hear it, the 
analog signal powers an audio amplifier that in turn powers a speaker which 
produces the sounds that we hear. 


Background 
Background of our topic. 


Current D/A Methods 


Remember that there are many uses for D/A conversion, not just audio 
reproduction, so different methods are used for different practical 
applications. One of the most frequently used methods in industry is D/A 
converters that oversample or interpolate. Oversampling is where one 
samples a signal at a much higher frequency than is specified by the 
Nyquist rate. This means that it samples much more often than at twice the 
bandwidth of a signal. This oversampling usually result in effectively 
shifting the noise present in the important low frequencies into high 
frequencies which can then easily be low pass filtered. This results in D/A 
converters with high resolution and low cost. Another commonly used type, 
and the one that most closely reflects what we chose in our project, is a 
binary-weighted D/A converter. This type has more focus on hardware 
implications, rather than the software implications of the oversampling 
mentioned earlier. Binary-weighted D/A converters require components that 
take each of the bits entering the converter to a summing point where they 
add up to an accurate Vout value. The disadvantage to this method is that 
the D/A converter is inaccurate due to the individual tolerances of the 
components. A subcategory of this converter type is an R/2R ladder, that 
has resistor values of R and 2R for each of the input bit streams. This fixes 
the accuracy problem for low resolution converters since it is simple to find 
resistors of these R/2R values. 


What is a BeagleBone Black? 
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The BeagleBone Black is a single-board embedded computer that is made 
with open source software and hardware for developmental purposes. For 
our project, we were provided a BeagleBone Black by the Department of 
Electrical and Computer Engineering at Rice University, but it is much 
cheaper than a full-fledged computer. Its processing power and storage 
Space are much smaller, but if one plugged in an HDMI to a monitor and a 
USB to a keyboard, it can seem like a real computer running Linux. There 
are 46 total I/O pins but many of them are reserved for special uses, so only 
approximately 30 of them are general purpose I/O. With the processor 
speed and I/O count in mind, we chose our design to rely on using 8 pins 
and therefore 8 bits. Also note that the audio files we stored on the 
BeagleBone Black have exactly a 44.1kHz sampling rate. 


Motivation 
A summary of our motivations and choices for this topic. 


D/A Conversion 


We were interested in this project because we wanted to explore the science 
and practicality behind D/A conversion but also to learn from the 
hardware/software interactions afforded by the BeagleBone Black. We 
wanted to take a stored digital signal and take it through the D/A conversion 
process primarily for the purpose of teaching others and learning ourselves. 
This process is enormously important in the world of contemporary 
technology and we are executing a novel process with this specific 
BeagleBone black. Controlling hardware/software interactions is integral in 
the design process for many electronics, so we wanted to incorporate this as 
well. 


Why choose the R/2R Ladder? 


Since we are using 8 bits to quantize our signal, an overall effective D/A 
converter that changes our digital signal to analog is the R/2R ladder. As 
discussed earlier, an R/2R ladder is a binary-weighted converter that uses 
resistors of only two different values: R and 2R (the actual values are 
insignificant, what matters is the 2:1 ratio). These resistors are cascaded 
together in the structure below, allowing for the output voltage to be a 
weighted sum of the input voltages. 
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Although there are other ways to implement a D/A converter, for digital 
signals of 8 bits or less the R/2R ladder is one of the best options. Some of 
the advantages of the R/2R ladder is that it is composed of only resistors of 
two different values, allowing it to be very easily implemented on a small- 
scale at a low cost (small resistors of specific values are can be cheaply 
produced). Some potential drawbacks of using an 2/2R ladder is that with 
longer ladders, the cumulative capacitance of the system could potentially 
delay the transmission of the signal as it is converted from digital to analog. 
However, since we are using 8 bit signals, there are only 8 rungs on our 
ladder, so there is no significant delay time due to the capacitance. Another 
potential problem with R/2R ladders in general is that with the more 
significant bits, the precision of the resistors are increasingly important. A 
small fluctuation in these resistors can completely overwhelm the output 
values of the smaller bits. Fortunately for us, since there are only 8 rungs in 
the ladder, only so much precision is required for our resistors in the most 
significant bit. 


Why choose RC Low Pass Filter 


After a digital signal is converted to analog, its amplitude is still quantized. 
Before it can be outputted as a continuous signal, its amplitude must be 
smoothed out between the different values. In the frequency domain, this is 
effectively low-pass filtering the signal. Although there are many different 
ways to implement a low-pass filter, we decided for our project that the 
most efficient way would be with a simple RC circuit. Like the R/2R ladder, 
a basic RC circuit can be implemented very cheaply and on a small scale, 
since it is composed of only a resistor and a capacitor. Since it is also very 
simple, our signal can propagate through it very fast. There is one potential 
problem with the RC (first-order) low-pass filter is that it attenuates more 
slowly than higher order filters, therefore not completely removing out 
higher frequencies. However, since the human ear cannot hear above 20,000 
Hz, this is not a problem for our implementation. We just require a filter 
good enough to remove the majority of the higher frequencies so as to not 
waste power and potentially damage the speakers, as well as to smooth out 
the different quantized levels in the time domain. A simple RC low-pass 
filter serves that purpose. 


Although there are other ways to implement a D/A converter, for digital 
signals of 8 bits or less the R/2R ladder is one of the best options. Some of 
the advantages of the R/2R ladder is that it is composed of only resistors of 
two different values, allowing it to be very easily implemented on a small- 
scale at a low cost (small resistors of specific values are can be cheaply 
produced). Some potential drawbacks of using an 2/2R ladder is that with 
longer ladders, the cumulative capacitance of the system could potentially 
delay the transmission of the signal as it is converted from digital to analog. 
However, since we are using 8 bit signals, there are only 8 rungs on our 
ladder, so there is no significant delay time due to the capacitance. Another 
potential problem with R/2R ladders in general is that with the more 
significant bits, the precision of the resistors are increasingly important. A 
small fluctuation in these resistors can completely overwhelm the output 
values of the smaller bits. Fortunately for us, since there are only 8 rungs in 
the ladder, only so much precision is required for our resistors in the most 
significant bit. 


Implementation 
Goes over the implementation of our project. 


Overall Design 


This is an image of our actual implementation of this project. As you can see, the BeagleBone Black on the left is 
not attached to any supporting computer. We have the data stored directly on the board, and have 5 volts coming in 
to power it from a power outlet. This power lets the BeagleBone Black perform to the best of its capabilities when 
generating its bit stream. So 8 bits of digital output streams leave the general purpose I/O and enter the R/2R 
ladder. The system takes in these 8 digital outputs and generates a voltage that is the weighted sum of each bit. The 
signal continues through the design, and into the RC low pass filter where the high frequency copies are filtered 
out. Finally, the signal goes through the audio jack, into speakers, where it gets amplified and played as sound. 
Also not that we incorporated input buttons into our design for an educational, live demonstration of the process. 


Full digital Signal 


In order to construct the full digital signal that is outputted by the BeagleBone Black, first we had to store the 
original digital data. We took audio files, turned them into a RAW format where they are unsigned bits so that they 
are easily handled. The BeagleBone Black outputs 8 new bits 44,100 times per second. These 8 output bits are 
changed to a 0 ora 1 based upon which audio file we read. As covered earlier, this yields 256 total voltage levels 
of output. 


The program we wrote outputs the digital realization of several types of functions as well as 8 bit unsigned audio 
files to select pins on the BeagleBone Black. It is controlled by polling 5 input pins connected to tactile switches 
that cycle function, frequency (if applicable) and play/stop. This program relies on the Beagle Bone Black GPIO 
library. 


First in our code, we want to establish pins, functions and songs. 
Definitions 


#define PI 3.14159265 
#define PORT 8 


#define PORT2 9 


//output pins 

#define PINO 19 //least significant 
bit 

#define PIN1 17 

#define PIN2 16 

#define PIN3 15 

#define PIN4 14 

#define PINS 13 

#define PIN6 12 

#define PIN7 11 //most significant 
bit 


//input pins 

#define PIN8 15 
//play/stop button pin 
#define PIN9 12 
//increase frequency pin 
#define PIN10 11 
//decrease frequency pin 
#define PIN11 14 

//next function pin 
#define PIN12 13 

//last function pin 


#define NUMFUNCTIONS 5 //used 
to check if function increment should roll over 


#define SONGO "Beethovend. raw" //define 
the audio file names 

#define SONG1 "TomSawyer.raw" 

#define SONG2 "HarderBetterFaster.raw" 


Now, we want to establish core functions for both the conversion process and the live educational demonstration. 
Functions 


//Function: check_low 
//Checks the state of a button/pin and waits for end of button press 
//Input: 
// pin: the number of the pin to check 
//Returns: 
// 1 if button is pressed/pin is low after waiting for the button to be 
// released/pin to return to high 
// 0 if button is not pressed/pin is high 
int check_low(int pin){ 
if (is_low(PORT2, pin)){ 
while(is_low(PORT2, pin))f{ 
iolib_delay_ms(1); 
} 


return 1; 


} 


else return 0; 


} 


//Function: stop 
//Polls the stop pin/button 


//Returns: 
//1 if button is pressed/pin is low after waiting for the button to be 
// released/pin to return to high 


//0 if button is not pressed/pin is high 
int stop(){ 

return check_low(PIN8); 
} 


//Function: play 
//Updates the 8 output pins with a new value 
//Input: 
// out: the 8 bit value to output 
void play(int out){ 
if (out and 0x1) pin_high(PORT,PINO); 


else pin_low(PORT, PINO); 
if ((out and 0x2) >> 1) pin_high(PORT,PIN1); 
else pin_low(PORT, PIN1); 
if ((out and 0x4) >> 2) pin_high(PORT,PIN2); 
else pin_low(PORT, PIN2); 
if ((out and 0x8) >> 3) pin_high(PORT,PIN3); 
else pin_low(PORT, PIN3); 
if ((out and 0x10) >> 4) pin_high(PORT, PIN4); 
else pin_low(PORT, PIN4); 
if ((out and 0x20) >> 5) pin_high(PORT,PIN5); 
else pin_low(PORT, PIN5); 
if ((out and 0x40) >> 6) pin_high(PORT, PIN6); 
else pin_low(PORT, PIN6); 
if ((out and 0x80) >> 7) pin_high(PORT,PIN7); 
else pin_low(PORT, PIN7); 

} 

//Function: playSine 

//Calculates the next value of sine for the given frequency and calls 


//play 

//Input 

// freq: the frequency of the desired sine wave 
void playSine(int freq){ 


int t = 0; 
int out = 0; 
while(!stop()){ 
out = 128*sin(t++*freq*PI/22050)+128; 
//calculate the next value to output and incrmemnt discrete time 
play(out); 
clock_gettime(CLOCK_REALTIME, and ttime); 
//get the time 
ttime.tv_nsec = 0; 
//clear the nanoseconds but not seconds so that we dont overflow; 
clock_settime(CLOCK_REALTIME, and ttime); 
//set the time back 
clock_gettime(CLOCK_REALTIME, and ttime); 
//get the time again 
ttime.tv_nsec += 21463; 
//increment by number of nanoseconds we should wait 
while(1){ 
//loop to delay next output 
clock_gettime(CLOCK_REALTIME, and curtime); 
if (curtime.tv_nsec > ttime.tv_nsec) 
break; 


} 


//Function: playTriangle 
//Calculates and outputs the next value of a triangle wave for the given 
//frequency 
//Input: 
// freq: the frequency of the triangle wave 
void playTriangle(int freq){ 
int out = 0; 
int t = 0; 
while(!stop()){ 


out = (256/(11025/freq) )*((11025/freq) - abs(tt++ %(2*11025/freq) - 


(11025/freq))); //calculate next value and increment discrete time 
play(out); 
clock_gettime(CLOCK_REALTIME, and ttime); 
//get the time 
ttime.tv_nsec = 0; 
the nanoseconds but not seconds so that we dont overflow; 
clock_settime(CLOCK_REALTIME, and ttime); 
the time back 
clock_gettime(CLOCK_REALTIME, and ttime); 
the time again 
ttime.tv_nsec += 21463; 
//increment by number of nanoseconds we should wait 
while(1){ 
to delay next output 
clock_gettime(CLOCK_REALTIME, and curtime); 


//clear 
//set 


//get 


// Loop 


if (curtime.tv_nsec > ttime.tv_nsec) 
break; 


} 


//Function: playSquare 
//Updates output to maximum and minimum values at the given frequency to 
//create a square wave 
//Input: 
// freq: the frequency of the square wave 
void playSquare(int freq){ 
int out = 0; 
int delay; 
while(!stop()){ 
out A= OXFFFF; 
//exclusive or all the bits to make a square wave 
play(out); 
delay = 1000000000/freq/2; 
//calculate how long to delay to match frequency 
clock_gettime(CLOCK_REALTIME, and ttime); 
//get the time 
ttime.tv_nsec = 0; 
the nanoseconds but not $ 
clock_settime(CLOCK_REALTIME, and ttime); 
the time back 
clock_gettime(CLOCK_REALTIME, and ttime); 
the time again 
ttime.tv_nsec += delay; 
//increment by number of nanosec$ 
while(1){ 
to delay next output 
clock_gettime(CLOCK_REALTIME, and curtime); 
if (curtime.tv_nsec > ttime.tv_nsec) 
break; 


} 


//Function: playSongO 
//Outputs a raw 8 bit unsigned audio file as defined above 
void playSong®@( void) { 

char cwd[1024]; 

if (getcwd(cwd, sizeof(cwd))== NULL) perror("getcwd() error"); 
current working directory 

char* path; 

path = malloc(strlen(cwd) + strlen(SONGO)+2); 
//Append the defined file name to the path 

strcpy(path, cwd); 

char temp[2] = "/"; 

strcat(path, temp); 

strcat(path, SONGO); 


FILE *fp; 
fp = fopen(path, "r"); 
if (fp == 0) perror("error opening file, is the name correct?"); 


int out = fgetc(fp); 
while((out != EOF) and and_ !stop()){ 
//play each of the stored values until end of file or the stop button is pressed 
play(out); 
clock_gettime(CLOCK_REALTIME, and ttime); 
ttime.tv_nsec = 0; 
clock_settime(CLOCK_REALTIME, and ttime); 
clock_gettime(CLOCK_REALTIME, and ttime); 
ttime.tv_nsec += DELAY; 
while(1){ 
clock_gettime(CLOCK_REALTIME, and curtime); 
if (curtime.tv_nsec > ttime.tv_nsec) 
break; 


} 
out = fgetc(fp); 


//clear 
//set 


//get 


//lLoop 


//Get the 


} 
fclose(fp); 
} 


//Function: playSong1 
//Outputs a raw 8 bit unsigned audio file as defined above 
void playSong1(void) { 
char cwd[1024]; 
if (getcwd(cwd, sizeof(cwd))== NULL) perror("getcwd() error"); 
char* path; 
path = malloc(strlen(cwd) + strlen(SONG1)+2); 
strcpy(path, cwd); 
char temp[2] = "/"; 
strcat(path, temp); 
strcat(path, SONG1); 


FILE *fp; 
fp = fopen(path, "r"); 
if (fp == 0) perror("error opening file, is the name correct?"); 


int out = fgetc(fp); 
while((out != EOF) and and _ !stop()){ 
play(out); 
clock_gettime(CLOCK_REALTIME, and ttime); 
ttime.tv_nsec = 0; 
clock_settime(CLOCK_REALTIME, and ttime); 
clock_gettime(CLOCK_REALTIME, and ttime); 
ttime.tv_nsec += DELAY; 
while(1){ 
clock_gettime(CLOCK_REALTIME, and curtime); 
if (curtime.tv_nsec > ttime.tv_nsec) 
break; 


} 
out = fgetc(fp); 


} 
fclose(fp); 
} 


//Function: playSong2 
//Outputs a raw 8 bit unsigned audio file as defined above 
void playSong2(void) { 
char cwd[1024]; 
if (getcwd(cwd, sizeof(cwd))== NULL) perror("getcwd() error"); 
char* path; 
path = malloc(strlen(cwd) + strlen(SONG2)+2); 
strcpy(path, cwd); 
char temp[2] = "/"; 
strcat(path, temp); 
strcat(path, SONG2); 


FILE *fp; 
fp = fopen(path, "r"); 
if (fp == 0) perror("error opening file, is the name correct?"); 


int out = fgetc(fp); 
while((out != EOF) and and_ !stop()){ 
play(out); 
clock_gettime(CLOCK_REALTIME, and ttime); 
ttime.tv_nsec = 0; 
clock_settime(CLOCK_REALTIME, and ttime); 
clock_gettime(CLOCK_REALTIME, and ttime); 
ttime.tv_nsec += DELAY; 
while(1){ 
clock_gettime(CLOCK_REALTIME, and curtime); 
if (curtime.tv_nsec > ttime.tv_nsec) 
break; 


} 
out = fgetc(fp); 


} 
fclose(fp); 
} 


Finally, in a portion of the main section of our code, we need to call these established functions in order to update 
pins and respond to button presses. 
Main 


//Function: main 
//Polls buttons for input and calls respective functions 
int main(void) { 
int ret[13]; 
iolib_init(); 
//initialize i/o library 
//intialize the pins to output and input 


ret[Q0] = iolib_setdir(PORT, PINO, DIR_OUT); 
ret[1] = iolib_setdir(PORT, PIN1, DIR_OUT); 
ret[2] = iolib_setdir(PORT, PIN2, DIR_OUT); 
ret[3] = iolib_setdir(PORT, PIN3, DIR_OUT); 
ret[4] = iolib_setdir(PORT, PIN4, DIR_OUT); 
ret[5] = iolib_setdir(PORT, PIN5, DIR_OUT); 
ret[6] = iolib_setdir(PORT, PIN6, DIR_OUT); 
ret[7] = iolib_setdir(PORT, PIN7, DIR_OUT); 


ret[8] = iolib_setdir(PORT2,PIN8, DIR_IN); 
ret[9] = iolib_setdir(PORT2,PIN9, DIR_IN); 
ret[10] = iolib_setdir(PORT2, PIN10, DIR_IN); 
ret[11] = iolib_setdir(PORT2, PIN11, DIR_IN); 
ret[12] = iolib_setdir(PORT2, PIN12, DIR_IN); 
int function = 3; 
int frequency = 500; 
while (1){ 
//poll for user input from buttons 
//check each pin for update 
if (check_low(PIN8) ) { 
printf("PIN8 pressed"); 
if (function == 0){ 
else if (function == 4)f{ 
playSong1(); 


else if (function == 5){ 
playSong2(); 
} 


} 
else if (check_low(PINQ)){ 
//increase frequency by 100 Hz 
printf("PIN9 pressed"); 
frequency += 100; 
if (frequency and gt; 20000){ 
frequency = 20000; 
} 


iolib_free(); 
return (0); 
} 


R/2R Ladder 


Our implementation of the R/2R ladder consists of resistors of 110 and 220Q, cascaded together to form eight 
different rungs, each fed by an output bit from the BeagleBone Black, as shown by the diagram. 
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In this configuration, the output voltage, Vout, can be calculated by the following formula:Vout = Vref * (A/248) 
where Vref is the 3.3V positive voltage of the BeagleBone Black and A is the binary value currently outputted by 
the ai's (0 < A < 255). This is because the lower significant bits are scaled-down by the 110Q resistors between 
them and Vout; each rung between a bit and the output voltage scale the bit by a factor of 1/2 before it reaches the 
output. With this system, we can use change the values of the 8 bits to output any one of 256 different voltages, 
scaled by 3.3V. 


One thing we considered when building the R/2R ladder is the tolerance of the resistors used in the last few rungs. 
Since the value of the most significant bit is hundreds of times more significant than the value of the least 
significant bit, we took extra precautions to select and use only the resistors with very tight tolerances for the last 
few rungs, especially the last rung. If we had not done so and the values of the last two resistors were off by a 
significant percentage, then our output signal would be considerably distorted along the amplitude axis; the highest 
possible output value might actually be higher or lower than expected and vice versa for the lowest possible output 
value. However, during our tests of the system, the oscilloscope showed no significant distortion of the output 
voltages, and our final output signal played as expected. 


RC Low Pass Filter 


When we implemented the RC low-pass filter, we chose the values of the resistor and capacitor to obtain a cutoff 
frequency (at which the gain is 0.707) of approximately 20,000 Hz. We decided to use a capacitor of 550nF and a 
resistor of 16Q, as shown below. 


The magnitude of the transfer function given by this system is: 


1 1 
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Given this transfer function, we can calculate the cutoff frequency to be 18,086 Hz. We could not find resistors and 
capacitors to have a cutoff frequency of exactly 20,000 Hz, but for audible sounds, we decided that 18,000 Hz will 
be enough to serve our purpose. 


Results 
Results of the implementation of our project. 


Full Digital Signal Output 


0 i rr 


The full digital signal output exists in the digital realm, but it is a bit stream 
that gets 8 new bits 44,100 times each second. As discussed earlier, the 
exact nature of this bit stream is based on the original audio file, but there 
are a total of 256 different output voltage levels. 


R/2R Ladder Output 


The full digital signal enters the R/2R Ladder and the voltage at output 
becomes the weighted sum of each bit. The 256 quantization levels make 
the signal appear blocky and really shows that higher frequencies have been 
introduced into the signal. 


The full digital signal enters the R/2R Ladder and the voltage at output 
becomes the weighted sum of each bit. The 256 quantization levels make 
the signal appear blocky and really shows that higher frequencies have been 
introduced into the signal. 


Audio Output 


Kernel Multitasker 


In our final results, we noticed that the final audio output had significant 
random delays. This is due to the kernel multitasker on the BeagleBone 
Black. Since it runs Linux, and Linux is not a realtime system, the 
processor will choose to run a background process that is not our code from 
time to time. We used the BBIOIlib which directly addresses the general 
purpose I/Os which avoid many of the Linux performance issues, but not 
this one. We wanted to collect data on our systems signal to noise ratio, but 
because of this significant and random delay, it would make our data 
significantly off. The final sound is pretty clear when coming through the 
speakers, other than the intermittent delay. However, we could calculate the 
signal to quantization noise ratio (SQNR) which was around 48 dB. 


Conclusions 
Conclusions on the results of our implementation. 


Performance on Characteristics 


Overall, we thought that our implementation of this D/A converter was a 
great success. As far as the characteristics of a D/A converter, we succeeded 
with several of them which in turn causes the others to be impaired. The 
power consumption of our system was low because we filtered out the 
power hungry high frequencies and implemented it completely passively. 
This, in turn, impacts the size of our design. It was small due the small 
number of components. The cost of our system was very low, just a handful 
of simple components and low power consumption. Finally, for speed, a 
44.1kHz reproduction rate is all that one desires when reproducing audio 
signals like we were. Our implementation had the shortcomings of accuracy 
and resolution. The system was inaccurate because of the Linux kernel 
multitasker delaying the signal at random intervals. Our choice of a low bit 
count in conjunction with an R/2R ladder forces the resolution to be low 
and introduces quantization error. 


Future Improvements 


There are future improvements that can be made to make this system even 
better. In order to tackle the Linux kernel multitasker issue, we cannot just 
set the priority of our designated process. Other background processes are 
forced to run, but we have the option to install a real-time kernel. This is a 
fairly substantial task and exists outside the scope of ELEC 301. However, 
the realtime kernel would be more accurate because it would be able to run 
our process exclusively and in realtime. We also could perform up-sampling 
on the fly which is interpolation that approximates what sampling at a 
higher rate would do. We discussed using upsampling in conjunction with 
splines in ELEC 301, but the implementation would be rather difficult. 
Finally, in order improve our resolution, we could increase the number of 
bits that our BeagleBone Black outputs. However, the more bits that we 
use, the lower the frequency that we can transmit them at is because the 
BeagleBone Black has limited processing speed. Also, if we upgrade to a 


16-bit system for instance, we would need to choose a design other than the 
R/2R implementation because as discussed earlier, the capacitance and 
tolerance of the system begin to matter. 
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this module is a link to download our poster for this project in pdf format. 


Introduction 
Description of the motivation for undertaking the task of creating selective 
transparent headphones. 


The advancement of signal processing technologies has consistently 
improved the enjoyment of listening to music. For example, noise 
cancelling techniques have been widely applied to headphones that, in 
addition to physical noise isolation, provide further reduction in noise. 


We believe, however, that it is sometimes important for music listeners with 
a pair of good noise isolating headphone to be aware of certain surrounding 
signals. For example, when people are crossing the street while listening to 
music, they really want to hear car horns to be warned of possible danger. 
We also think that human speech is important, and that people may want to 
hear others talking to them while they are listening to music. 


Based on the assumption above, we propose to build a selective transparent 
headphone that propagates certain surrounding sound signals and 
suppresses all other signals. In this project, we explore possibility to select 
only speech signal to propagate and silence other surrounding sounds. 


The following sections are organized as follows. The second section 
provides an overview of the project, defining the problem and explaining 
high level system implementation. The third to the fifth sections explain in 
detail the blind source separation algorithm and binary artificial neural 
network, respectively. The following sections discuss experiment results, 
and suggest next steps. 


Project Overview 
Overview of the steps followed to create the system. 


Problem Definition 


There are a number of sound signals that are of importance in the 
environment. In this project, we identify human speech as an important 
signal, that we would like to focus on. We would like to separate the 
environment sound signals into a signal containing only human speech and 
a signal containing all other speech. By making the above simplifications, 
the problem is reduced to finding the speech content in the surrounding 
environment, and forwarding the speech content to the listener while 
suppressing other signals. If there are no speech signals, or speech signals 
are weak compared to other signals, all sounds from the environment will 
be attenuated. 


System Implementation 


We approach the problem stated above by first separating the source signals 
into human speech content and non-human speech with a blind source 
separation algorithm. After that, a classification algorithm determines 
which signal contains human speech, if any, and outputs that signal. A 
detailed overview is shown below with the system block diagram in Figure 
i, 
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The audio input, a mixture of human speech and some other noise such as 
instrumental music, is passed to the blind source separation block. Inside 
this block, short time Fourier transform is first performed to change the 
time domain input signal into frequency domain, since all the other 
operations on the signals reside in frequency domain. After that a 
preprocessing filter is applied, which cleans the input signal by removing 
reflection and ambient noises. Then for each frequency, the independent 
component analysis algorithm separates the signal into two parts such that 
independence between the two signals is maximized. Scaling and 


permutation filters minimizes distortions caused by using the independent 
component analysis method. The blind source separation process outputs 
two signals, where at most, one signal contains speech. 


We then implement a binary artificial neural network (ANN) for 
classification. The two identical ANN take the two signals separated from 
the source as inputs respectively and outputs a weight for each signal. We 
train the neural network in the way that a signal containing more human 
speech will output a higher weight. Based on the weights of the separated 
signals, the output selection multiplexer chooses the proper signal to output. 
That is, if one signal, referred to as A, has a higher weight than the other 
signal, referred to as B, and its weight is above a certain threshold (0.5), the 
multiplexer outputs signal A. If both the weights of signal A and signal B 
are below that threshold, then the multiplexer outputs nothing. We set a 
threshold to deal with situation where both signal contain no human speech. 


Background on Independent Component Analysis 

Describes the motivation and background behind using independent 
component analysis as the basis of the blind source separation algorithm 
used in this endeavor. 


The blind source separation algorithm, we employed, is based around 
independent component analysis, which is explained in more detail below. 
The goal is to recover independent sources given only sensor observations 
that are linear mixtures of independent source signals. We assume that the 
source signals are statistically independent and non-Gaussian. 


ICA works by solving the following model. 
T= As 


Where vector x contains n observed signals, vector s contains independent 
sources that comprise x, and matrix A denotes the mixing matrix. It is 
assumed that the mixing matrix A and the source vector x are both 
unknown. ICA algorithm thus produces the best estimation for both the 
mixing matrix and source vector. 


The first step in ICA is usually whitening (sphering) the data, which 
removes any correlations in the data, i.e. the signals are forced to be 
uncorrelated. Putting the words in mathematical terms, we seek a linear 
transformation V such that when y = Vx we now have E[yy’'] = I. 


After sphering, the separated signals can be found by an orthogonal 
transformation of the whitened signals y (this is simply a rotation of the 
joint density). The appropriate rotation is sought by maximizing the non- 
normality of the marginal densities. This is because of the fact that a linear 
mixture of independent random variables is necessarily more Gaussian than 
the original variables. 


Although in theory ICA algorithm can perfectly separate source signals 
from observed signals, provided that sources are statistically independent 
and non-Gaussian, most of the algorithms do not work quite well in real 
world scenarios. For example, ICA does not typically deal with reflection; 
that is, if mixed signal is observed in a small room with reverberation, ICA 
algorithms such as FastICA hardly identifies individual components from 


the observed signal. Therefore, we need a more robust blind source 
separation algorithm, which is discussed in the following section. 


Blind Source Separation 

Describes the method to separate an arbitrary signal into its independent 
components for the application of creating a selective transparent 
headphone. 


Background 


When separating mixtures of instantaneously mixed signals, independent 
component analysis works very well, but this ideal situation is rarely found 
in the real-world. In practical applications, the environment distorts audio 
signals by adding echoes, reflections, and ambient noise. Additionally 
independent component analysis, in its purest form, assumes that source 
signals do not have any propagation delay, which is an assumption that 
cannot be applied in this case. Recording sources from two microphones 
placed in different locations will inevitably introduce propagation delays, so 
the blind source separation method used must also consider this issue. 


To solve problems detailed above, the blind source separation problem will 
be redefined in the time-frequency domain. By taking the short-time Fourier 
transform (STFT) of the audio inputs, we can represent the inputs as the 
following. 


r(w, t) = [X1(w, ¢), X2(w,t), 4 XM(W, t)|* 


Xi(@,t) refers to the input observed at the ith microphone observed at time 
t. The T symbol refers to the transpose, so in this case, the input sources are 
represented along the rows. For our application, we will assume that the 
number of observed inputs and the number of separated sources are both 
equal to the constant M. We can, now, formulate our blind source separation 
problem as the following. 


x(w,t) = A(w)s(w, t) + n(w, t) 


In this case, s(@,t), refers to the vector of source signals observed at 
frequency @ and at time t. The second term, n(@,t), represents any type of 
noise or distortions that may be present in the observed signals (noise, 
reflections, etc.). The blind source separation method attempts to solve for 
the mixing matrix, A(@), which can be represented as the following. 


Ajj (w) = Hi; ; (w)eI¥Ti.3 


Here, Hi,j(@), refers to the transfer function of the jth source to the ith 
microphone. Additionally, ti,j, refers to the propagation delay from the jth 
source to the ith microphone. 


The system, for which our problem is defined on, is now redefined from an 
instantaneous mixture of signals to a convolutional mixture in the time- 
domain of the following form. 


“o(t) = a(t) *s() 


This formulation allows for a solution which considers the distortion, 
n(q,t), added by the environment and considers the inherent propagation 
delay that is inherently present in our application. 


Since the new formulation of the blind source separation problem is a 
convolutional system in the time-domain, we will solve the separation 
problem in the time-frequency domain formulation shown in Figure 2. 
Given our approach outlined above, we can solve for the complete 
frequency-domain separation filter shown below. 


G(w) = Pw)Biw)Bw) 


B(q@) is the separation filter of the system. Applying this filter will separate 
the spectrum of the observed signal into the spectra of the source signals 


observed at frequency, w. As stated above, by reformulating the problem 
into the frequency domain, the convolutional mixture becomes transformed 
into an instantaneous mixture. As such, independent component analysis 
can be applied to separate the source signals at each frequency. However, in 
doing so, we introduce other problems, inherent to the independent 
component analysis method, that the other filters attempt to resolve. 


The following sections will explain the three filters in detail. 


¢ B(q), the separation filter. 

¢ Bm+, the normalization filter used to solve the scaling problem that 
arises after applying independent component analysis. 

e P(q@), the permutation filter used to solve the frequency distortion 
problem that arises after applying independent component analysis. 


Separation Filter 


The separation filter, shown in the previous section can be broken down 
into the following components. 


W(@) is the preprocessing filter used to reduce reflections and ambient 
noise by using the subspace method. U(q) is the filter obtained by applying 
independent component analysis after preprocessing. 


I. Subspace Method 


The spatial correlation, or autocorrelation, matrix at frequency q@ is defined 
below. 


R(w) = Ela(w, t)x” (w, t)] 


Given that there are M inputs and M sources, the resulting matrix will be 
MxM. This matrix can be rewritten in the following way. 


R(w) = AQA? + K 


The spatial 
correlation matrix. 


K = E[n(t)n" (t)] 


The noise 
correlation 
matrix. 


Q = E[s(t)s*(t)] 


The sources' 
cross-spectrum 
matrix. 


In other words, K is the correlation matrix of n(t) and Q is the cross- 
spectrum matrix of the sources. However, since the source signals, s(t), are 
unknown at this point, this equation cannot be directly solved. Instead, we 
can represent the generalized eigenvalue decomposition of the spatial 
correlation matrix in the following manner. 


R= KEAE™! 


E refers is the eigenvector matrix, E = [e1,e2, ...,eM], and A = 
diag(A1,A2,...,AM) refers to the eigenvalues of R. Since K, the noise 
correlation matrix, cannot be directly observed apart from the source 
signals, we will assume that K = I, which will evenly distribute the 
reflections and ambient noise induced by the environment among the 
estimated sources. As such, the eigenvalue decomposition of R can be 
rewritten as the following. 


Raf 
The final preprocessing filter uses the eigevector and eigenvalue matrices 


resulting from the standard eigenvalue decomposition of the spatial 
correlation matrix and is shown below. 
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II. Independent Component Analysis 


By applying the preprocessing filter obtained from the subspace method, 
our observed signals are now of the following form. 


y(w,t) = W(w)e(w,t) 


To estimate the source signals from the preprocessed input signals, we can 
use independent component analysis to solve the linear system displayed 
below, for the filter U(o). 


y(w,t) = U-+(w)8(w, t) 


Now that the separation filter at each frequency has been found, the 
problems that arise from the separation process must be addressed. 


Scaling 


The scaling problem that results from applying independent component 
analysis can be resolved using the filter shown below. 


Bt (w) = diag|By, 1, Bros +> Br al 


Bm,n+(q) refers to the (m,n)th entry in B+(o), which is the pseudo-inverse 
of B(@) (regular inverse also works here since we assume that the number 
of separated sources and the number of inputs are both M, so that the 
resulting separation matrix is square). Additionally, m refers to an arbitrary 
row or microphone in B+(o). By applying this matrix, each signal, i, is 
amplified by the component of signal i observed at microphone or input m. 
Essentially, this filter amplifies the frequency components of the separated 
source signals so that the waveforms of the resulting sources will be 
distinguishable and audible. 


Permutation 


A second critical problem that arises after applying independent component 
analysis is that the order in which the source signals are returned is 
unknown. Since independent component analysis is applied at each 
frequency, we must find the permutation of components that has the highest 
chance of being the correct permutation to reconstruct the source signals 
and minimize the amount of frequency distortion caused by independent 
component analysis. 


We define the permutation matrix, P, as the following. 


Z(w) = [A (w), Z2(w),..., Zr (w)| 


The matrix, P, exchanges the column vectors of Z(o) to get different 
permutations. We define the cosine of 0n between the two vectors zn(@) 
and zn(@0), where w0 is the reference frequency is as follows. 


a, ME bio) 
cos(9n) = Te (oyicz# (wool 


The permutation is, then, determined by the following. 


P = arg maxp F(P) 


The cost function, F(P), used above, is defined as follows. 


This compares the inverse vector filter at each frequency with a filter at a 
reference frequency, w0. However, if we use the same reference frequency 
to find the permutation at every other frequency, then the problem arises if 
the filter at the reference frequency has the incorrect permutation. In order 
to minimize this error, a new reference frequency is chosen after the 
permutations of filters at K frequencies are found. As such, the frequency 
range of a reference frequency, @0, is as follows. 


wo =wW—k- Aw, Vk = 1,...,K 


Background on Artificial Neural Network 
This module talks about the backgrounds on artificial neural network 


Motivation: 


As the Blind Source Separation algorithm randomly assigns the two 
independent components into two output vectors, we need the artificial 
neural network to determine which of the two signals, if any, contains 
human speech. 


Background on Our Chosen Neural Network Model: 
Our Neural Networks Model: 


A simplified schematics of our neural network looks like the following: 


Layer L, Layer L, 


Neural Network Model 


Our artificial neural network consists of one input layer (Layer L1), one 
hidden layer (Layer L2), and one output layer (Layer L3). The circles 
labeled “+1” are called the bias units, and corresponds to the intercept term. 


Our neural network has parameters (W,b) = (Wb, W@),b@), where we 
write wo; to denote the weight associated with the connection between 
unit j in layer 1, and unit i in layer 1+1. Also, b“; is the bias associated with 
unit i in layer 1+1. We also use s; to denote the number of nodes in layer | 
(not counting the bias unit). 


We will write al). to denote the activation of unit i in layer |. For l=1, we 
also use a“); =x; to denote the i-th input. Given a fixed setting of the 
parameters W,b, our neural network defines a hypothesis hw,,(x) that 
outputs a real number. Specifically, the computation that this neural 
network represents is given by: 
2 1 1 1 1 
a? = f(WY 2, + WE ao + WY 23 + 1) 
2 1 1 1 1 
ay? = f(Way) a1 + Way) 2 + Wy3)23 + 03”) 
2 1 1 1 1 
ay” = f(Wsy) a1 + Wyy 2 + Wy3'3 + 65”) 
3 2) .2 2) (2 2) (2 2 
hwa(a) = ay” = f(Wyyay” + Wiz’ a” + Wi3?aS” + bY”) 
Let 2; denote the total weighted sum of inputs to unit i in layer 1, including 
the bias unit. For instance, 
2? = 02, Wea; +P) 


, so that 


i (l 
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If we extend the activation function f(.) to apply to vectors in an element- 
wise fashion, we can write the above equations more compactly as: 


2) = We + 6 

q?) = f( 2)) 

28) — wR,_@ 4. 5°) 
hy-»(2) = q') = y (2°) 


The above equations represents the forward propagation step. More 
generally, as we use a“) =x to denote the values from the input layer, then 
given layer |’s activations a”) , we can compute layer 1+1’s activations a“*) 
as: 


+1) — yrOg 4 AO 
att) — f(z) 


Backpropagation Algorithm: 
Suppose we have a fixed training set 
{(c®, y),. bg (a™ y™)} 


of m training examples, the overall cost function is: 


m nme—-l sy 8 
J(W, 6) = 2 Son b; a) re >> > > (we?) 
i=1 i=1 i=1 j=1 
1 m 1 ' “ii 5 ag St Si41 9 
- Pos (; \lrwe(x) — y®|| ) +5 y > 2 (w}?) 


The first term in the cost function is an average sum-of-squares error term. 
The second term is the regularization term which is used to prevent 


overfitting. The weight decay parameter A controls the relative importance 
of the cost term and the regularization term. 


Our goal is to minimize the cost function J(W,b) as a function of W and b. 
To begin training our neural network, we will initialize each weight Ww; 
and each b©); to a small value near zero. It is important to initialize the 
parameters randomly for the purpose of symmetry breaking. 


In order to perform gradient descent to minimize the cost function, we need 
to compute the derivatives of the cost function with respect to Ww; and b. 
respectively: 


0 lw O -_ ai 
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The intuition behind the backpropagation algorithm is as follows. Given a 
training example (x,y), we will first run a “forward propagation” to 
compute all the activations, including the output value of the hypothesis 
hw,p (x). Then for each node i in layer 1, we compute an error term 5. that 
measures how much that node was responsible for any errors in the output. 
For an output node, we can directly measure the difference between the 
network’s hypothesis and the true value, and use that to define the error 
term for the output layer. 


Here is the implementation of the backpropagation algorithm in 
MatLab: 


1. Perform a feedforward propagation, computing the activations for 
hidden layer L2 and output layer L3. 
2. For the output layer (layer ny ), set 


509) = —(y— al) » f(z) 


1. For the hidden layer, set 


69 = ((W)T5) e f'() 


1. Compute the desired partial derivative of the cost function with respect 
to W and b 


Vwwod (W, 6b; 2, y) = 5D (Q)P, 
Vuod (W, b; @; y) = ft), 


Implementation note: in step 2 and 3 above, we need to compute f’(z"); ) 
for each value of i. as our f(z) is the sigmoid function and we already have 
a‘). computed during the feedforward propagation process. Thus using the 
expression for f’(z), we can compute this as 


f (2) =a? - a?) 


Below is the pseudo code of the gradient descent algorithm: 
Note: AW“ is a matrix of the same dimension as W” and Ab“ is a vector 
of the same dimension as b") . one iteration of the gradient descent as 


follows: 


1. Set AW := 0, Ab := 0 for all 1. 
2. For i from 1 to m, 


a. Use backpropagation to compute 


Vw (W, 6; a, y) and 
Vind (W, b; ©, Yy).- 


a. Set 
AW® := AW® + VywJ(W, b; 2,y). 


b. Set 


Ab® := Ab! + Viwd (W, b; x, y). 


1. Update the parameters: 


WO =W®-a (aw) + wwe 
m 


19 p99 [as 
m 


We can now repeat the gradient descent steps to reduce our cost function 
J(W,b). 


Implementation and Experimental Results 
This is our implementation of neural network and Blind Source Separation 
and their respective results 


Implementation of Blind Source Separation: 


When implementing the Blind Source Separation algorithm, we chose the 
following parameters, 
Parameters 
Sampling Rate 16 kHz 
Length of STFT 912 


Number of FFT 4096 


Points 

Overlap 492 
Number of inputs 2 
Permutation 5 


Reference Range 


Input Parameter Into The System 


Result of our Blind Source Separation: 
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Observed Signal One Sampled at 16kHz 


Mixed Signal Observed at 2"? Microphone 
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Estimated Source Signal One 
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Estimated Source Signal One-human speech 


Estimated Source Signal Two 
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Estimated Source Signal Two-Instrumental Music 


Figure 2 and Figure 3 show the mixed signals observed at the 1% 
microphone and the 2" microphone respectively. The mixed signal is a 


recording of a person counting from zero to ten with instrumental music 
playing in the background. The intermittent spikes in the mixed signal 
represents the person counting while the continuous, small-amplitude signal 
represents the instrumental music. 


After applying the Blind Source Separation algorithm, we obtain two 
independent signals shown in Figure 4 and Figure 5 respectively. These two 
signals are very distinct from each other. The output signal 1 shown in 
Figure 4 consists mainly of the person counting whereas the output signal 2 
shown in Figure 5 is primarily the instrumental music. 


Implementation of Neural Network: 


Since we need our neural network to perform binary classification, we used 
512 nodes for the input layer, 25 nodes for the hidden layer, and 2 nodes for 
the output layer. 


We used 10,000 positive training examples and 10,000 negative training 
examples to train our neural network. The positive training examples 
mainly consist of audio file containing examples of human speech such as 
news broadcasting. The negative training examples mainly consist of 
examples of non-human speech such as instrumental music. 


After taking 512 point Short-Time Fourier Transform, we place our training 
sets into a 20,000-by-512 matrix where each row represents a single 
training example. Then we used the backpropagation and gradient descent 
algorithm described in the section above to train our neural network. We 
used 70% of the training examples for training, 15% of the training 
examples for validation, and another 15% for testing. 


Result of Our Training of Neural Network: 


After employing backpropagation algorithm and gradient descent algorithm 
to train our neural network, we get the following test results: 


Training Confusion Matrix Validation Confusion Matrix 


Output Class 
Output Class 


1 1 


2 2 
Target Class Target Class 


Test Confusion Matrix All Confusion Matrix 


Output Class 
Output Class 


1 1 


2 2 
Target Class Target Class 


Accuracy of Neural Network 


As our neural network is a binary classification model, our output is a 
vector containing two entries. The first entry represents the probability of 
the input signal being a human speech and the second entry represents the 
probability of the input signal being a non-human speech. The two entries 
sum to one. The overall performance of our neural network is shown in the 


All Confusion Matrix. The red block showing 0.2% means 0.2% of the 
class 1 examples (human speech) has been incorrectly identified as class 1 
(non-human speech). Similarly, the red block showing 1.2% mean 1.2% of 
the examples of class 2 (non-human speech) has been incorrectly identified 
as Class 1 (human speech). The blue block shows that our neural network is 
able to distinguish between human speech and non-human speech with an 
accuracy of 98.6%. 


Conclusion and Next Steps 
This is our conclusion 


Conclusion 


From our experimental results, we can see that our Blind Source Separation 
algorithm is able to separate the two independent components of the mixed 
signal very well. After the separation, the neural network can distinguish 
between human speech and non-human sound with an accuracy of 98.6%. 
Consequently, our entire system can successfully output the human speech 
from a linear mixture of sound signals. 


Next Steps 


Implement the system in real-time. Our design currently works in the 
Matlab environment. It is implemented to deal with audio that is already 
recorded, but it does not deal with real-time audio streams. We eventually 
would like the system to work in real-time, separating the sources, forward 
speech, if any, and suppress other signals instantly so that people can 
benefit from this system. This implies improving the efficiency of the 
algorithms, and making use of Matlab SimuLink to explore the possibility 
of real-time implementation of our system. 


Additionally, we would like to explore the possibility to separate several 
sources in addition to human speech. As mentioned in the introduction, 
other signals from the surrounding might also be important, such as a car's 
horn. We can improve the separation algorithm and train the neural network 
so that the system can recognize not only human speech, but also other 
important signals from the environment. With this implementation, users 
can have the option to forward in the signals they want. In order to achieve 
that, one needs to increase the number of microphones to increase the 
number of observed signals, and re-train the neural network so that it 
recognizes different types of sound signals. 
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Selective Transparent Headphone : CCC 
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RICE 


Rice University Department of Electrical and Computer Engineering 


Objective 


Explore ways to build a selective transparent 
headphone that propagates speech signal and 
attenuates all other types of signals. 


Project Overview 


~ Since our objective is to blindly separate an input signal into its 
principaliindependent components, we will use independent 
‘component analysis to separate the signal into its components. 

- Then, we will utilize a neural network to select which output signal. if 
any, contains human speech, and forward it to the output. 


Independent Component Analysis 


y= input signals (sound mixtures) 
5 = original sources 
* ICA attempts to find the independent sources (individual sounds) 
that comprise an input signal by solving the following formulation 
for A’ 


ve=As 
» Ais the mixing matrix used to combine the original sources, s, to 
obtain the observed input signals, y- 


Blind Source Separation (BSS) 

* Transform M microphone inputs into frequency domain by taking 

the short-time Fourier transform (STFT). 

0.) =LY (0.0....X,,@0F 
Find subspace filter to reduce reflections and ambient sound. 
Wo) =A77EF 

* Find ICA filter matrix, U(@), for all sampled frequencies. 
* Separation filter: B(@)=U(a)W (a) 
* Scaling fitter:  87(@)=diagB> 


* Permutation fitter: Pp Deere, 
* Final fitter: F(@)= K@)B(@)B() 


For questions, please contact: 


Zichao Wanoince edu: Tianvi Yao@rice edie Stephen Xiai@rice edu; Sher 


» We used a three-layer neural network to distinguish between 
human speech and background noise. 

* More specifically, our neural network consists of an input layer, a 
hidden layer, and an output layer and uses the backpropagation 
algorithm and gradient descent to minimize the cost function. 


Figure 1: Mixed signal Figure 2: Mixed signal 
observed at 1** microphone observed at 2" microphone 
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Introduction to FFAST in Digital Multitone Communication 


Digital communication is a pervasive technology that many individuals, 
corporations, and governments rely upon to share information. Numerous 
researchers and engineers have focused their efforts on decreasing the time 
required for digital communication. One of the greatest such advancements 
is the Fast Fourier Transform (FFT), an algorithm [link] developed by 
James Cooley and John Tukey in 1965 that reduces the computational 
complexity of the DFT from to for appropriately 


chosen , where _ is the signal length. 


For many years, was believed to be the best achievable run- 
time complexity for computing a discrete time Fourier transform (DFT). 
However, in recent years, researchers have developed a class of algorithms 
known as Sparse FFT algorithms [link] which run in sublinear time with 
respect to. . These algorithms are able to have such a low run time 
complexity because they assume that the signal is sparse in its Fourier 
representation; that is, there are nonzero DFT coefficients where 


In this report, we present one such Sparse FFT algorithm, the Fast Fourier 
Aliasing-based Sparse Transform (FFAST)[link], as a method to 
significantly decrease computation time in a digital multitone (DMT) 
communication scheme because it can operate in . In Module 2, 
we present the DMT scheme and demonstrate signal sparsity. In Module 3, 
we outline the architecture of FFAST. In Module 4, we highlight several 
FFT methods used in our project. We describe our computational 
experiment and provide numerical results in Module 5 and discuss practical 
considerations and other future work in Module 6. 


Digital Multitone Modulation 


Digital Multitone Modulation 


We present the digital multitone modulation scheme and demonstrate its 
suitability for demodulation via FFAST. 


Digital Multitone Modulation Scheme 


Let a finite alphabet A = {a1, Q2,°** QA } be given, where each symbol 

a € A is associated with a unique sequence of B ordered bits, 

bB-1, eee bi, bo, where B = [logs JA] ] and 6; € {0, 1} for 

1 =0,---,B-—1. For example, let A be the set of all lowercase letters in 
the English alphabet and associate each letter with its order in the alphabet. 
In this case, the binary sequence ~01101' corresponds to the thirteenth letter, 


W 


“m : 


Generally speaking, Digital Multitone (DMT) Modulation is a “parallel 
communication scheme in which several carriers of different frequencies 
each carry narrowband signals simultaneously” [link]. These narrowband 
signals are usually sinusoids that encode the binary sequence associated 
with each symbol. If a bit is “high", then the corresponding sinusoid is 
expressed in the output signal; otherwise the bit is “low" and the sinusoid is 
not expressed. More precisely, given a symbol a € A with the 


corresponding binary sequence bg_1,---, 61, bo, the message signal m/(t) 
is defined to be 
Equation: 
l B-1 
m (t) = ———— © bgcos (20 (k + 1) fot) 


So Ob 0 


for some fundamental carrier frequency, fo. 


In our previous example where A is the English alphabet and the letter “m' 
corresponds to “01101”, the message signal m/(t) is the sum of the first, 
third, and fourth harmonics, as shown in the figure below: 


bebe O) 


05 
J _ 
4 = 
4 \J Piel) 31/4 


2 
T/4 


Decomposition of “m” in DMT Scheme 


In our computational experiments, we use digital multitone modulation to 
encode 8-bit Extended ASCII values. An Extended ASCII table can be 
found here. Below are several symbols and their digital multitone 


modulation signals. 


| 6 | 36 | 0011 0110 | 


Table of Extended ASCII Values 


04 04 
0.2 02 
0 0 
02 02 
0.4 0.4 
06 06 
i 08 08 
—~ ‘0 1 2 3 4 5 6 “o 1 2 3 4 5 6 
1 
= 08 08 | 
06 06 
04 04 
02 02 
0 0 
02 02 
04 0.4 
06 06 
08 08 
19 1 2 3 4 a : 2 3 4 5 6 


t (sec) 


Different Symbols in DMT Scheme 


Sparsity in Digital Multitone Modulation 


Sparse FFT algorithms only achieve low runtime complexity if the input 
signal is sparse in its Fourier representation; that is, if for a length-/NV signal, 
there are k nonzero DFT coefficients with k << N. FEAST, the sparse FFT 
algorithm that we will be using, requires the sparsity constraint k << N 3, 
Recall that the message signal, m(t), is defined as 

Equation: 


B-1 
oT ere beosan bh 
De, 0 OK %=0 


so that the Nyquist frequency is 2B fo. In order to ensure signal sparsity, the 
sampling frequency should be a multiple of the fundamental carrier 
frequency so that each of the sinusoidal components falls into a single 
frequency bin. This type of “on-the-grid" sampling may be expressed as 
Equation: 


fs =Nfo 


where JN is the length of the sampled signal. Note that in [link] each 
sinusoid contributes two DFT coefficients. Thus, k = 2B if all bits are high 


1 
so that the sparsity condition may be expressed as 2B < N32. 


Sampling plays a large role in signal sparsification. There are many 
sampling methods that ensure sparsity and we present two different 
methods. The first method involves padding the input signal to achieve 
sparsity. Consider sampling at the Nyquist frequency so that f, = 2Bfo. As 
stated, this sampled signal is not necessarily sparse — in fact, k = N if all 
bits are high! However, periodizing the sampled signal sufficiently many 
times will result in higher frequency resolution by placing zero-value 
coefficients in between the nonzero coefficients, thus sparsifying the signal. 
This method results in a spectrum with nonzero coefficient few and far 
between. Second, consider sampling at a sufficiently high rate to satisfy the 
sparsity condition; that is, f, > 8B° fo. First note that 8B? fy > 2Bfy so 
that aliasing does not occur. This method results in a compact spectrum 
where only the first and last B coefficients are nonzero. See the figure 
below for a spectra that are characteristic of these methods. 


|DFT| 


Coefficient Index 


Spectra for Different Sampling Schemes 


It should be noted that there are many sampling methods that ensure signal 
sparsity but for the purposes of this project, we care only that the signal is 
sparse. Sampling schemes are discussed further in [link]. 


Fast Fourier Aliasing-based Sparse Transform 
A brief description of the FFAST algorithm. 


Fast Fourier Aliasing-based Sparse Transform 


Fast Fourier Aliasing-based Sparse Transform (FFAST) is a sparse FFT 
algorithm developed by Sameer Pawar and Kannan Ramchandran in May 
2013 [link]. We present a formulation of the algorithm that is specialized 
for our digital multitone scheme. 


FFAST consists of three modules: the downsampling module, the FFT 
module, and the peeling module. See [link] for a diagram of this 
architecture. 


Downsampling Module FFT Module Peeling Module 
| 


FFAST Architecture 


Front End 


The downsampling module consists of three stages. In each stage 2, the 
signal and a delayed version are both downsampled by a sampling 
coefficient, n; € Z,. This introduces aliasing in the frequency domain, 
which will be a key component in the peeling module. It is necessary that 
the sampling coefficients n; are coprime factors of the singal length N; that 
is, Np Ny Ng = N where no, 1, Nz, are all relatively prime. 


These smaller subsignals are passed to the FFT module, which computes 
the DFT of each subsignal. Any FFT algorithm may be used in this stage 
with the condition that it works for a general signal length NV. The DFTs of 
the subsignals at stage 2 are then paired together. We denote such a pair as 


yi = (a; {l], £; [l]), where x; [1] and %; [I] are the 1;;, values of the DFT of 
the normal and delayed subsignals of stage 7, respectively. 


Back End 


The peeling module takes these smaller DFT pairs and backsolves a 
bipartite graph to obtain the DFT coefficients of the original signal. To 
understand the structure of this graph, recall that the aliasing caused by 
downsampling “mixes" frequency domain components. More precisely, the 
coefficients of the smaller DFTs are a linear combination of the original 
DFT coefficients. Consider a graph with two types of vertices: the smaller 


DFT pair coefficients y and original DFT coefficients Xp]. If an original 
DFT coefficient contributes to the value of a smaller DFT coefficient, an 
edge is placed between the two vertices. It is easy to see that this is a 
bipartite graph because the vertex set can be partitioned into smaller DFT 
coefficients and original DFT coefficients. 


We denote a smaller DFT coefficient vertex as a zero-ton if no nodes are 
connected to it, a singleton if exactly one node is connected to it, and a 
multi-ton if it is neither a zero-ton nor a singleton. 


Ifa vertex y! = (a; [I], Z; [l]) is a pair of zeros, then it is a zero-ton. 
Otherwise, to determine whether a vertex is a zero-ton, a singleton, or a 
multi-ton, the algorithm uses a “Ratio Test" [link]. Recall that a circular 
shift in the time domain is a multiplication by a complex exponential in the 


frequency domain so that we may use the values ing} to determine whether 
the vertex is a singleton. To perform this ratio test we may check if the 
quantity 

Equation: 


is an integer. If q is an integer, then the vertex in question is a singleton and 
thus, X [gq] = x; [1]; otherwise, the vertex in question is a multi-ton. 


We now describe the process of backsolving this bipartite graph to get the 
DFT coefficients of the original signal. If a vertex is a zero-ton, we may 
remove it from the graph because it provides no relevant information. If a 
vertex is a singleton, we have obtained a DFT coefficient X|q]. By the 


“mixing” process of aliasing, we know which smaller DFT pairs yi that 
X|q| contributes to. With this information, we may subtract X|q] from these 
smaller DFT pairs, thus removing edges from the graph. We repeat these 
steps until all edges are removed from the graph and X|q] is known 
completely. This process is known as peeling and is reminiscent of 
decoding Low Density Parity Check codes. 


Convergence Conditions 


In general, FFAST is a robust algorithm that can handle noise, many signal 
lengths N, and sparsity factors k, where k is the number of nonzero DFT 
coefficients. We presented a specific noiseless version of the algorithm that 
requires the sparsity constraint k < N US As previously mentioned, 
FFAST also requires that the subsampling coefficients are coprime factors 
of N. With these conditions, FFAST is guaranteed to converge to a solution 
almost surely. 


FFT Methods 
FFT Methods 


Self-Sorting Mixed-Radix FFT 


The mixed-radix approach utilizes clever subsampling of x|n] and 
permutations of twiddle factor matrices to lower operation counts. The first 
step is to compute the prime factorization of the signal length NV. Prime 
factors of 2 and 3 are then combined to create as many factors of 4 and 6 as 
possible. The resulting prime factorization has the form N = [es fi. We 
can then perform a single step T; of the algorithm by computing N/f; FFTs 
of length f;. 


Four unique matrix constructions are necessary to generalize a single step 
of the algorithm. The first is the identity matrix I, € IR"*”. The second is 
the permutation matrix P? € R%®*?, where 


Equation: 


— 1 ifj=ra+sandk=sb+r 
PE (is) = 


otherwise 


Note that P,* swaps positions ra + s and sb + r ina vector. The third 
matrix to consider is a diagonal matrix of twiddle factors D¢ € R®*®, 


where 
Equation: 


if7=k=sb4+r 


otherwise 


we" 


Di (ik) = |" 


with wap = e /27/(9>) The fourth and final matrix construction to consider 
is the standard DFT matrix W,, € R””*”. 


The mixed-radix algorithm requires the interaction of operations on the 
subspaces R‘, 1 = 1,..., F, and so it is necessary to consider the 
Kronecker Product in calculations. The Kronecker Product is an operation 
defined as ®@ : R™*” x R?*7 5 R™*"d_ If C = A @ B, then 
Equation: 


v=p(ii-1)+k 


C(v,w) =A(i,j)B(k,1) and 7 =q(j-1)+! 


Using these matrices, it possible to compose a single steps 7; using the 
following equation [link]: 
Equation: 


T; = (PiDi ® Ip.) (Wh, @ Im) 


where 
Equation: 
Pi = Vag (po — 1), 1=0,...,F 


qi = N/pi, 1=1,..., 
m= N/fi; 2 eee 


Note that T; € R™*%, 
To implement the algorithm, each stage is applied iteratively to acquire X. 


T;, may be simplified into two separate steps using the definitions of each 
matrix to yield the following algorithm structure: 


for i = 1:nf 
for a = 0: (q(i+1)-1) 
for b = 0:(p(i)-1) 

xsub = x(b+(0:£(i)-1)*m(i)+a*p(i) + 1); 

if £(i}) <= 6 
t = WinogradFFT(xsub) ; 

else 
t = RaderFFT(xsub) ; 

end 


for lambda = 0: (£(i)-1) 
xprime ( (a*f (i)+lambda)*p(i)+b +1) =... 
exp (-2*pi*li/q(i) *lambda*a) *t (lambda+1) ; 


Mixed Radix 


The DFT module in the above code utilizes the Winograd Short-Length 
Transforms to minimize operations when computing DFTs of length less 
than 6. For all other lengths, Rader's FFT is used. 


Rader's FFT 


Rader's FFT for prime lengths exploits results from number theory to 
express the DFT as a composite-length cyclic convolution [link]. Any prime 
number p defines a multiplicative group modulo p, denoted 

Zr, = {0,1,2,...,p —1}. This group is cyclic, ie., 

Va € Zp Ag € N?_» :a=g =g ?, 0<1,j < p — 2, where g is known 
as the primitive root modulo p and N?_. = {0,1,2,...,p —2}.In 
practice, there is no general formula for finding g for p, but in this 
implementation N is known and therefore g may be stored in a lookup 
table. 


The general form of the DFT of length p is given by 
Equation: 


x, = c,w* =F4+ SN a,w, k=1,2,...,p—1 


nk 
Pp 


and k range from 1, 2,...,p — 1, we can rewrite the above formula using g 


Since the twiddle factor, w?’”, is naturally computed modulo p and both n 


Equation: 


p-2 
ss —(r-4) 
X,+=%+) Gy, 5 THO Ly p 2 
q-0 


This formulation is of the form of a cyclic convolution of two sequences @ 
and 8 where ag = hg: and 6, = w ° of length p — 1. This convolution is 
computed via the convolution theorem: 

Equation: 


X —% = DFT! |[DFT{a] - DFT[4]] 


For speed, the sequence a is zero-padded between its zeroth and first index 
to a length M which is a power of 2 such that M > 2p — 3 and is 
cyclically repeated to be length M. Since N is known, it is possible to store 
the DFT of the sequence £ in a lookup table. The DFT of a is then 
computed using the standard radix-2 Cooley-Tukey algorithm. The two 
DFTs are multiplied pointwise and then the inverse DFT is again calculated 
using the standard Cooley-Tukey algorithm. The first p elements are then 
the result of the cyclic convolution. The final result is attained after adding 
back the DC offset. 


For small numerical examples demonstrating this implementation of 
Rader's FFT, see this excellent guide here. 


Computational Experiment of FFAST 
Computational Experiment 


Experiment Design 


We wanted to determine the most computationally efficient demodulation 
method for the digital multitone scheme. In our experiment, we compared 
the operation counts of an optimized FFT meta algorithm, a partial DFT 
computation (the DFT computed only for the nonzero coefficients), and 
FFAST in demodulating various message signals. We chose these 
demodulation methods because they are currently the most efficient 
methods for DMT demodulation. The experiments were run using 
MATLAB 2014a. We chose to run our experiment in MATLAB for its rapid 
prototyping environment. 


While we were interested in comparing the computational efficiency of 
these algorithms, we chose to record operation counts rather than run times. 
Run times are unreliable metrics on machines with multitasking operating 
systems, especially when using highly optimized programs like MATLAB. 
We chose to count complex additions and multiplications as one operation 
each. We did not count conditional statements as operations because in most 
general processors, they only require a single cycle. We counted complex 
exponentials and trigonometric functions as one operation because they 
may be implemented using lookup tables. 


Meta-FFT Algorithm Implementation 


The two main goals for our implementation of the optimized FFT meta 
algorithm were to i) create an algorithm that performs without the need to 
zero-pad the signal and ii) allows us to count operations. Well-behaved 
signal lengths in our implementation of the algorithm have the form of 

N = ngnjno, where n,; are coprimes. To exploit this structure, the 
optimized FFT meta algorithm, implemented in meta_fft .m, uses a self- 
sorting mixed-radix complex FFT [link]. For sub-transforms of length 


2 < N <6, short-length Winograd transforms are applied to conserve 
operation count. All other transforms are computed using Rader's FFT 
algorithm [link]. 


FFAST Implementation 


The FFAST algorithm implementation requires the signal itself, the length 
of the signal, and a vector of the downsampling coefficients. 


The FFAST algorithm was implemented in two files. The first file, 
ffast_front_end.m, downsamples the signal by each of the coprimes 
and feeds shifted and unshifted versions of the downsampling to the 
meta_fft.m file, to get operation counts and the relevant FFTs. Once the 
relevant DFT pairs are generated, Ffast_front_end.m calls the back 
end of the algorithm. 


The second file, peeling decoder .m, implements the peeling module 
of FFAST to backsolve the bipartite graph. The program will return a flag if 
the algorithm encounters no singletons at a stage where it has not been fully 
solved. 


Numerical Results 


In our first experiment, we varied the signal length N and observed the 
operation counts required for the optimized FFT meta algorithm, the partial 
DFT, and FFAST. We constructed each signal in the Fourier domain by 
randomly selecting & values from the set of integers {1,---, NV} and setting 
the corresponding DFT coefficients to a For each signal, we put k = N ue 
, which is the greatest allowed sparsity in our scheme. The choice of k and 
values of the k nonzero DFT coefficients are consistent with the DMT 
scheme. We used MATLAB's library function 1f Ft ( ) to compute the 
corresponding signal and counted the number of operations it took each 
algorithm to compute the DFT. We observed that the operation count 
required for FFAST was usually an order of magnitude less that the 


operation counts required for both the meta FFT algorithm and the partial 
DFT. See Fig [link] for the results of the first experiment. 
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Experiment 1 


In our second experiment, we varied the signal sparsity & and observed the 
operation count required for the optimized FFT meta algorithm, the partial 
DFT, and FFAST. For this experiment, signal length V = 8740 and the 
signals were constructed in the Fourier domain, as before. We observed an 
80% computational decrease from the optimized FFT to FFAST for all 


k< No. However, fork > N z, we observed signals for which FFAST did 
not converge. See Fig [link] for the results of the second experiment. 
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Experiment 2 


Note that the operation count for the partial DFT is less than the operation 
count for FFAST for k < 8; this is somewhat misleading because the way 
that the partial DFT is computed is slightly optimistic. The partial DFT 
computes only the nonzero coefficients, which are known a priori in this 
experiment. In the general framework of DMT, one would need to compute 
all possibly nonzero DFT coefficients, resulting in an operation count 
higher than that of FFAST. 


Discussion and Future Work 


Discussion and Future Work 


In any digital communication scheme there exist design parameters that are 
independent of the scheme itself: digital sampling rate and system baud 
rate. These parameters directly determine the values of N and B in the 
DMT communication scheme. It is therefore plausible to have many 
different values of N and B. 


From the results, the FFAST algorithm outperformed the mixed-radix FFT 


for all signals with lengths greater than 214, with the additional condition 


1 
that 2B < N3. Thus, if a communication scheme takes at least 2!4 


iL 
in the time it takes to send B = | $2043) | = 12 simultaneous bits, a 


samples 


sparse FFT will require fewer computations than the tested existing 
frequency domain schemes, reducing receiver bottlenecking, and will 
therefore be practical to use with some system designs. 


Ultimately, the best way to address the viability of the sparse FFT (and 
therefore expand on the goals of this paper) is to physically implement a 
communications system compatible with the algorithm itself. While this 
paper has attempted to address concerns about the possibility of 
implementation there are still further matters to consider before a physical 
interpretation of this algorithm can arise. 


The first and foremost matter to consider is that the version of the FFAST 
algorithm that we implemented only works when the signal is exactly 
sparse. Practically, the communications scheme would have to work with a 
noisy channel. A noisy version of the FFAST algorithm does exist [link], 
however, and should be tested to verify our results in a noisy case. 


Second, it would be useful to devise a more efficient communication 
scheme that takes into consideration the fact that the sparse FFT converges 
even though it does not“know” where the signal is not sparse. In our 
experiment, we allotted the first B “slots” of the frequency domain of our 
signal to the sinusoids, a way to guarantee that the frequency sparsity of our 


signal would not exceed 2B. This does not take into consideration that for 
any given WV there are 
Equation: 


Ss 2° 


wy 2 


different ways to have a sparse signal of density 2B. Finding a coherent 
way of organizing these different possibilities and using them will give 
transmitted signals a much higher density and also allow for a higher baud 
rate of the system (in the example above, B would be increased from 12 
bits to 127 simultaneous bits!). 


Ultimately, once these considerations are taken into account, a coherent 
Sparse Communication system seems much more plausible. 


The Team 


Christopher Harshaw implemented a GUI for the demo of the project, JJ 
Allred researched and coded the meta-FFT algorithm implementation, and 
42 Prieto coded the FFAST front and back ends. All members worked on 
the report and poster. You can download our poster as a pdf here. . Team 
Awesome, signing off. 


Introduction 


Introduction 

The process of speech recognition comes naturally to humans since the ear 
acts as an auditory system, giving us the gift of sound. The ear converts the 
mechanical vibration signal of sound into electric signals that the brain can 
identify. However, computers cannot translate signals naturally like the 
human ear so we have to come up with creative methods to determine what 
was Said. Speech recognition is complex, but we can simplify it if we only 
focus on recognizing vowels. Consonants generally have erratic frequency 
responses, but vowels are special in that their frequency responses have 
several well defined peaks which are called formants. Long story short, if 
we can figure out the formants of a vowel, we can figure out the 

vowel. Insert paragraph text here. 


Motivation 

Many big name companies like Microsoft, Apple, and Google are creating 
speech recognition programs. We decided that we wanted to make 
something useful and relevant to the outside world and build an interactive 
software. This way our project has more functionality and the user is not 
restricted when interfacing with the program. Building smooth 
communication between humans and computers has been a focus of 
engineering in the 21st century and we wanted to do our part in this 
incredible, ongoing effort. 


Objective 

Our team aims to take a small aspect of speech recognition, developing a 
vowel recognition algorithm and test its accuracy. Our initial objective is to 
use the Fourier Transform to extract an input vowel signal and have the 
output identify the vowel. Once that proved mostly successful, we moved 
on to a harder task: recognizing vowels at arbitrary spacing with arbitrary 
length. Namely, we challenged ourselves to find a way to interpret the 
vowels in an stream of speech. 


Vowel Recognition Overview 


Concept of Vowel Recognition 

Vowel recognition is an interesting challenge because it applies system 
theory to our own physiology. Whenever we attempt to speak, the glottis - 
the part of the larynx containing the vocal cords - starts to vibrate. These 
vibrations can be modeled as white noise which become audible and 
coherent sounds as they pass through the vocal tract. Past experiments have 
repeatedly confirmed that the vocal tract can be treated as a linear, time- 
invariant system. 


With this, we can proceed in one of three ways: we can model the system as 
having all zeros (moving average), all poles (autoregressive), or some 
combination of poles and zeros. Since we can only observe the output of the 
filter - the speech the escapes the vocal cavity - we choose to model the 
system as having only poles, because such a model has little dependence on 
the original input signal. With an autoregressive model, we can generate a 
transfer function to approximate the filter with a degree of precision 
proportional to the filter's parameter. It should be noted that a higher order 
model generally works better but is also more computationally expensive. 
Once we have the transfer function, we look at the frequency response and 
determine which frequencies the peaks occur. These frequencies are the 
formants and we can look at known formant charts to determine which 
vowel was spoken. 


Identifying Vowel Formants 

The term formant refers to peaks in the harmonic spectrum of a complex 
sound. You can see this spectrum by taking a Fourier transform and looking 
at the frequency response of the signal. Formants in the sound of the human 
voice are particularly important because they are essential components in 
the intelligibility of speech. The distinctness of the vowel sounds can be 
attributed to the differences in their first three formant frequencies. 
Producing different vowel sounds amounts to returning these formants 
within a general range of frequencies depending on the particular person. 
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Formants of an 'Ah' Sound (Adopted from 
HyperPhysics) 


From the frequency response of the vowel formants, we can look at how the 
peaks of the frequency in the harmonic spectrum line up with the 
corresponding dark lines in the spectrogram. The dark areas are where the 
formants are and the graph show them at the same frequency. 
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Spectrogram of the 'Ah' Sound 


The Autoregressive Model 
The Autoregressive Model 


Speech recognition also utilizes probability theory to understand the effects of the 
vocal tract system. Since there are many physiological factors that contribute to the 
buzzing of the glottis, the Central Limit Theorem allows us to model it, (x[n]), as 
an iid process with Gaussian distribution of mean zero and constant variance. 
Equation: 


u-0,0°=K 


Knowing that the signal value at any time t is some random variable, we naturally 
begin to wonder about the autocorrelation of the process, the measure of how 
correlated signal values at different times are. For independent processes, we know 
that the autocovariance (which is the same as the autocorrelation in this case) 
equals zero for any time t; with some other ¢2, and equals the variance at tg = f1. 
This is illustrated in the derivation below for the zero-mean case: 

Equation: 


C,=R, =E[z(t)x(t+7)] Vr 
Equation: 
for #0, becomes E/x(n)|E[z(n + 7)] = 0 
Equation: 
for t =0, becomes E [x (n)| = 0° 

With this in mind, we know that 

Equation: 

R, [k] = 076 [k] 

We now know the autocorrelation of the input and the autocorrelation of the output 
y|n] (the output is known - it's the speech signal). We define the DTFT of R, to be 


the power spectral density of x. That is, 
Equation: 
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We can show that for LTI/LSI systems, the power spectral density of the output 
becomes the power spectral density of the input multiplied by the square of the 
magnitude of the transfer function. Note that this holds only for wide-sense 
stationary processes - i.e., processes for which R, is only dependent on 7 and not 
on ¢. Starting with the convolution sum, 

Equation: 


y(n] = > x [n— klA[k) 
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Equation: 


R, |n| = E(y(n)y(n+7)| = £ ( 3 x |n— kh {k] % sintr— intl 
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Equation: 
. = by h [r]h [k]B (a [n — hla [n+ 7-1) 
Equation: 
= AUR (rk 7) 
Equation: 
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Letting u = 7 + k — 7, this becomes: 
Equation: 
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Equation: 
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A"(f) A(f) S2(f) 
Equation: 
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Equation: 
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So now we can solve for the transfer function. Note that the value of the variance 
does not matter, because it affects the magnitude of the graph but does not affect its 
shape. The system is depicted below. 


Equation: 
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If we use a linear, time invariant system with only poles, we can model the output 
as a recursive average as shown. 
Equation: 


yin] =aln]+ >) ayln—7 


Statisticians often refer to a model of the following form as an autoregressive 
process. Note that it has the same form as our equation above. 
Equation: 


Y=e+)> ay |n — 3] 
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They have algorithms which can find the values of that constant. Those constants 
are the coefficients of the polynomial in the denominator of the transfer function. 
This is essentially the autoregressive model we talked about earlier. The input is 
the numerator of our transfer function. Again, the actual value of the numerator 
doesn" ™t matter because it only changes the magnitude of the frequency response. 


Now that we know how to model the transfer function, we can tackle the formants. 


One Vowel Recognition 


After we gathered information about probabilistic interpretations of speech 
recognition, we were ready to begin building the project. Starting with the 
basics, we built a simple program that would take as input a three-second 
long signal and produce as output a measure of the corresponding formants 
of the signal. By applying the auto-regressive model, we were able to come 


up with a reasonable transfer function modeling the speech. However, we 
encountered a slight problem. 


If we estimated the transfer function using an auto-regressive model of low 
order, the associated frequency response would be a little too smooth. L.e., it 
would perhaps mask some of the peaks of the graph and cost us valuable 
information in determining the formants. On the other hand, if we used a 
high-order model, we would get too much variation in the frequency 
response because of the intricacies of the high-order rational function. We 
would not be able to tell which peaks were true formants and which peaks 
were just "overestimation" by our model. This is demonstrated in the figure 
below for the test speech signal "hat.mat”. 
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Low vs High Order of the Autoregressive Model 


This eventually started to look like an optimization problem, and since none 
of us had much relevant expertise, we approached it slightly differently. 
Instead of solving a case-by-case parameter optimization problem with 
unnecessarily difficult math, we simply abstracted the problem in terms of 
something we knew - polynomial regression. 


The Taylor series tells us that we can approximate every function as a 
polynomial, whose order determines its degree of precision. As with the 
parameter for the AR model, increasing the polynomial approximation's 
order would result in a high degree of information saved from the signal. 
Since we were more familiar with Taylor series than machine learning and 
Statistical approximation, we chose to rely more strongly on polynomial 
regression than the AR model. 


With this in mind, we used MATLAB to solve a generalized least-squares 
regression problem. In other words, we came up with a solution that is both 
accurate and precise. We used a high-order (30 order) a.r. model to get the 
accuracy we needed and a moderate-order(120 order) polynomial 
regression to get the precision we needed. This is illustrated in figure below 
for "hat.mat”. 
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Formant Determination of the Signal when Sujay is Saying the Word 
'Hat' 


After using predetermined average vowel formants found on the internet to 
estimate the formants of the signal, we then implemented a matched-filter 
which would find the vowel whose theoretical formant values were most 
similar. Our filter was based on the method of least-squares, so it chose the 
vowel for which the mean squared difference was smallest. We plotted the 
measures of the mean-squared differences for hat.mat below. Note that the 
lower the bar, the more similar the match between the corresponding vowel 
and the signal. 
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Mean Squared Difference Equation 
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Measured Mean-Squared Differences of the Signal 
when Sujay is Saying the Word 'Hat' 


Multiple Vowel Recognition Analysis 


The Thought Process 


Initial Considerations 

The program we initially developed was capable of accepting one vowel 
sound over a course of one or two seconds. Naturally, we wished to expand 
our program capacity to work for sequences of vowels with variable 
separation in time. With the increasing complexity of our project, however, 
we faced even more situations and problems we had to consider. The first 
and most important problem we had to consider was whether or not to parse 
the data into sections where we would "guess" the location of the vowel. 
The next problem, closely connected with the first, was a method of 
differentiating noise from actual sound content. A last problem was 
implementing the method optimally into MATLAB. 


For reasons that will become clear soon, we chose to compute vowels by 
creating a continuous, nonoverlapping partition of the signal instead of just 
separating out vowel content. 


The Math 

Consider the sample signal below. The actual vowel content is contained 
within samples 2000-3500. White Gaussian noise has been added between 
samples ~1000-4500. This superposition represents a combination of both 
external noise infiltrating the signal and unwanted content in the speech 
itself. To clarify that last part, consider the following example. In 
determining the vowel in "hat," we must find a way to deal with 1) external 
noise, 2) the consonants 'h' and 't’, and 3) the transition between a consonant 
and the vowel and vice-versa. 


Sample signal 


i) 
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Sample Signal of the Word 'Hat' with Problematic 
Noise 


If we go the path of taking the chunk between samples 1000 and 4000 and 
guessing it's the vowel (which is the best we can do in the "parsing" case), 
we take into consideration a lot of unwanted noise that will affect our 
formant guesses. Since roughly half the power in this signal is noise, our 
results may be corrupted. How might we get around this? 


We looked to a partitioning method instead, because it seemed noise- 
effective to a higher degree. Consider taking the entire signal and dividing it 
into chunks of 500 samples a piece. 


Let's say that we iterate through the sequence of chunks and put the 
following restriction on our method: "Whenever we see that four chunks of 
our signal match to the same vowel's formants, we say that vowel is 
definitively in the signal." If we proceed this way, we notice that samples 
from 1000 to 2000 will yield two formant pairs that *might* be similar, 


while the samples between 2000 and 4000 yield four very similar formant 
pairs because of the vowel content. Since the formants identified in the 
initial noise don't satisfy the initial restriction, we throw them out. Since the 
formants identified in the signal itself do satisfy the restriction, we keep 
them as a means to identifying the vowel. While time-"parsing" can't really 
deal with the complication of noise, it is easy to see that with this new 
"partition" method, we are effectively filtering out the effects of 
inconsistent noise (both external and internal) by throwing out any 
unreliable data. 


This helps get rid of some of the potential inaccuracy of our program, but 
how might we deal with inevitable background noise? We determined 
experimentally that the mean (absolute value of the) amplitude of any 
partition containing the speech content is orders of magnitude higher than 
parts that are just background noise. With that, we set two conditions that 
would automatically exclude any such noise from being in our 
consideration. As we analyze the signal in progressive chunks, we first look 
to see if the chunk has a max amplitude of at least 0.1 of the max amplitude 
of the signal. Then, we check to see that the mean of the magnitude of the 
chunk was greater than or equal to the mean of the signal. If one or more of 
these conditions were not met, we would immediately discard the current 
chunk and move on to the next chunk of signal. 


The Application 


We built the prototype by dividing our problem into five subproblems: 
initializing our data, reading an audio signal, determining the frequency 
response, cleaning up formant data and displaying information relevant to 
the vowels. 


Initial Data 

In our function initialize_all_data, we store the theoretical vowel formant 
pairs for the words "see", "play", "hat", "palm" and "rug" into a matrix for 
later use. We also set up a convenient matrix called the output_matrix, 
created so that we could elude the need to go through 7 if-else statements 


when outputting a vowel. 


Audio Input 

After the user decides how long he or she wants to speak, we use a built-in 
MATLAB script to read in an audio signal of that length from the user. The 
sampling rate of 44.1kHz is intentionally high so that we can preserve a lot 
of information from the original signal. 


Assumptions 

1) The user will speak at a relatively constant amplitude. Note that this is a 
little easier said than done, because words like "see" are naturally much 
quieter than "hat." 


2) The user will speak slowly and clearly so that we can look for 
consistency in determining the vowels said. 


The Algorithm 
1) Establish a set of variables that store theoretical vowel formants. 


2) Record an audio signal of n seconds from the user. 
3) Split the data into non-overlapping chunks of 4000 samples. 


4) Preprocess the data by detrending with a low-order polynomial and using 
a low-order Butterworth lowpass filter. 


5) Determine the transfer function of the vocal tract associated with the 
current chunk using an ar model. 


6) Determine the peaks of the transfer function (the formants) and match 
with the closest formant, filtering by a least means-squared matched filter. 


7) If amplitude of signal exceeds 0.1 max amplitude of overall signal, we 
assume it's potentially a vowel. Put its formants onto the "stack" of 
recent_formant_pairs. 


8) If the last four formant pairs have been consistent (the same value), then 
we will assume the vowel estimation has worked. Add the formant pair to 
the "stack" of vowels identified in tracktimes_and_formants. 


9) Now we have processed our guesses for what vowels were said when. 
We are going to have repeats, so run through a for-loop to clean up the 
track_times_and_formants vector into a new, workable vector called 
track_begin_end_formants. 


10) Output data and guesses. 


Limitations 

1) Requires very clear enunciation. Difficult to establish because we 
naturally change the pitch of our words as we start and end them. Example: 
the end of the vowel in "strut" sounds remarkably similar to "hat" because 
of the way you shape your vocal tract as you close your mouth. 


2) User has to say each vowel for a moderate duration of time. Too long, 
and a wavering voice would affect success of the program. Too short, and 
not enough data to assign formants. 


3) This program is calibrated with formant values averaged across all ages 
and genders. If a male has an exceptionally deep voice, or a child/female 
has an exceptionally high voice, they may not be able to get accurate vowel 
readings from the program. Accents may also affect accuracy of results. 


Data Collection and Analysis 


Procedure 

When we began testing of our program, we knew that we would have to be 
very clear in conveying its strengths and limitations from the start. 
Otherwise, users might be tempted to rush and speak incoherently in the 
excitement to see if the program worked. With this in mind, we gave the 
following set of instructions to every person who tested our program. First, 
they needed to speak clearly and slowly, extending their vowels just a little 
beyond their usual length. Next, they were allowed to add pauses of any gap 
between the words as long as the pause was noticeable. Most importantly, 
any consonants they pronounced had to be soft, not hard. In the word "rug," 
for example, the hard 'g' gets a lot of emphasis and ends up overshadowing 
the rest of the word (including the vowel). That's why we told them to shift 
their normal pronunciation of "rug" to a softer "ru-ck." After these 
precautions were given, the user would say some combination of one- 
syllable words similar to the ones in our predefined database, waiting to see 
that the program might recognize this aspect of his voice. 


Results 

Our program was very successful, ranging from nearly perfect with shorter 
one-to-three word speeches, to moderately accurate with four-to-six word 
speeches. To see just how effective the program is, let's run through its 
capabilities with a test signal. Click here for a sample of Sujay speaking the 
words "see", "hat", "play", "head", "palm" and "rug". Since we anticipated 
handling both live speech and stored audio files, we run our program with 
the name of the file as a parameter and wait. Shortly, MATLAB produces 
the following output. 


The program has finished analyzing your input signal. It has deter- 
mined the following information about the words you said. 


According to the frequency spectrum of your input signal: 
...the vowel you said at roughly time t1 = 0.90705 seconds to time t2 
= 1.3606 is the central vowel of the word ”see” 


...the vowel you said at roughly time tl = 2.2676 seconds to time t2 = 
2.9025 is the central vowel of the word “hat” 


...the vowel you said at roughly time t1 = 4.1724 seconds to time t2 = 
4.7166 is the central vowel of the word “play” 


...the vowel you said at roughly time t1 = 5.6236 seconds to time t2 = 
6.3492 is the central vowel of the word “head” 


...the vowel you said at roughly time tl = 7.347 seconds to time t2 
= 7.8005 is the central vowel of the word “palm” 


...the vowel you said at roughly time tl = 8.7075 seconds to time t2 = 
9.4331 is the central vowel of the word ”rug” 


At the very least, this tells us that the program can recognize the vowels 
chronologically. But how can we ensure that the program identified the 
correct time ranges for the vowels? Along with the text shown, the program 
also outputs a graph showing where in time it estimates sound content in the 
original signal. It is shown in the figure below. Interestingly enough, by 
avoiding the "parsing" approach to this problem, we managed to do an even 
better job of parsing the signal into its vowels. 
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And, for good measure, a total count of each vowel is displayed to visualize 
the program output for longer strings of words. 


Total count of each vowel 


see play hat palm rug head 


More Data 
A few more samples of data are shown below for demonstrative purposes. 


Figure 9: Nathan Bucki says *Hat Cat Rug Thug” 
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...the vowel you said at roughly time t1 = 1.2699 seconds to time t2 = 1.4513 


is the central vowel of the word “hat” 


...the vowel you said at roughly time t1 = 2.9932 seconds to time t2 = 3.1746 


is the central vowel of the word “hat” 


...the vowel you said at roughly time t1 = 4.898 seconds to time t2 = 5.0794 is 


the central vowel of the word "rug” 


...the vowel you said at roughly time t1 = 6.9841 second to time t2 = 7.2563 is 


the central vowel of the word ”rug” 


Figure 10: 


Dylan Jones says *Bee Baa” (pronounced like bat)” 
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...the vowel you said at roughly time t1 = 0.9075 seconds to time t2 = 1.2699 
is the central vowel of the word ”see” 
...the vowel you said at roughly time t1 = 3.6281 seconds to time t2 = 4.5352 
is the central vowel of the word “hat” 


Data Analysis 


We found that our project was generally successful, but our data collection 
showed us some flaws that we did not anticipate. First and foremost, the 
natural tonal variation in pronunciation made it difficult to determine vowel 
content. Typically when one says the word "see," they start off with a higher 
pitch and end with a lower pitch. Since our program works best for 
monotonous vowel sounds - that is, vowels whose sounds do not vary from 
one time to the other - it was challenging to succeed without first warning 
the user to maintain monotone in the course of every vowel. Second, we 
found that however hard we tried to throw out the effects of consonants on 
our program, they still appeared, albeit to a small degree. In the formation 
of the a word beginning with "h", for example, one starts off by slightly 
parting the lips and blowing air before proceeding to say the rest of the 
word. If the person is trying to say "hat," he or she would maintain his 
output of air while stretching his mouth and tightening his vocal cords. If he 
or she is trying to say "head," he or she opens his or her mouth vertically 
while applying similar pressure to his vocal cords. This phenomenon of 
speech is known as coarticulation. Note that although the end result - the 
vowel - is different from "head" to "hat", the process of getting there is very 
similar. That's why our program sometimes mixed up words that start the 
same, like "head" and "hat", and words that end the same, like "hat" and 
"strut". Last but not least, we found that the program did not work as well 
for people with accents or exceptionally deep/high voices. This was a little 
frustrating, but not unexpected, because even Google and other big 
companies struggle with this issue regarding its audience. 


Instructions for Usage 
1. Download the file here and unzip the files to a directory of your choice. 


2. Modify lines 34 and line 41 in main\_program.m to reflect the desired 
directory for data and the current directory, respectively. 


3. Open MATLAB and change MATLAB's working directory to wherever 
you unzipped the files. If you wish to test the program with live audio, run 
main_program and follow the instructions. If you wish to test the program 
with prerecorded audio, run Elec301spectrogram_loadfile(parameter), 
where parameter is the name of your matlab .m audio file in single quotes 
and without the extension. The audio file must be contained in the same 
folder as the program's files. 


4. Follow the on-screen instructions. Make sure to speak slowly and clearly, 
with monotonous vowels and distinct pauses between each word. 


5. Let the program work its magic! For loaded audio, the full output will be 
displayed on the console. For live audio, the full output will be available as 
a pdf file in a child of the data directory that is unique to your run of the 
program! 


Future Work 


In the future, we might hope to be able to implement some more effective 
ways to get rid of consonant effects in testing this program. Or better yet - 
we could figure out how to process the consonants and develop a guess of 
what those might be as well! An interesting idea for the current status of the 
project would be to implement a Markov chain relating a sentence to the 
vowel flow in that sentence. That way, if a user spoke a short phrase, we 
could make a guess as to what sentence he said! 


Contributions 


Sujay: Developed and implemented the algorithm to recognize vowels in a 
continuous-stream in MATLAB. Wrote about the thought process, coding 
and data analysis. 


Mauro: Researched and implemented the autoregressive model in 
MATLAB. Wrote about the probabilistic model. 


Steven: Researched and implemented functions in MATLAB that were 
useful for data gathering and analysis. Wrote about the vocal tract system, 
limitations, and theory of formants. 


Project Poster 
https://cnx.org/content/m52128/ 


Conclusion 


Speech recognition is an important goal of engineering worldwide, but to 
enable the broader study of speech processing as a whole, we have to study 
it in chunks. For this project, we focused on vowel recognition, studying the 
theory behind formants and random processes that would enable us to build 
a highly successful program. Our initial development allowed us to 
recognize one vowel at a time, but upon further inquiry, we were able to 
implement an intelligent vowel recognition algorithm for streams of 
isolated words. 
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